-
-
Notifications
You must be signed in to change notification settings - Fork 328
Description
Feature Category
- New API functionality
- Performance improvement
- Developer experience improvement
- Documentation enhancement
- Tool/utility addition
Problem Statement
Is your feature request related to a problem? Please describe.
EntityFacts.to_dataframe() deduplicates by period — keeping only the latest filing for each (concept, period_end) combination. This means Q1 FY2024 data from the Q1 FY2025 10-Q (filed 2025-05-02) overwrites Q1 FY2024 from the original Q1 FY2024 10-Q (filed 2024-05-01).
For point-in-time (PIT) backtesting, you need to answer: "What data was publicly known at date T?" If the dedup always keeps the latest filing, you get lookahead bias — the number 1 mistake in financial backtesting.
Example of the problem:
AAPL Revenue Q1 FY2024:
- Original filing: 10-Q filed 2024-02-02 → Revenue = $119.6B
- Restated in: 10-Q filed 2025-02-01 → Revenue = $119.6B (same, but sometimes restated)
If you filter to_dataframe() for "as of 2024-06-01":
- Current behavior: Only the 2025-02-01 version exists → LOOKAHEAD BIAS
- PIT behavior: Both versions exist → filter filing_date <= 2024-06-01 → get 2024-02-02 version
Who would benefit from this feature?
- Beginner Python users working with SEC filings
- Financial analysts and researchers
- Advanced developers building financial applications
- Data scientists working with financial datasets
Proposed Solution
Describe the solution you'd like
Add a pit_mode parameter to to_dataframe() that preserves all filing versions:
class EntityFacts:
def to_dataframe(
self,
include_metadata: bool = False,
columns: list[str] | None = None,
pit_mode: bool = False, # NEW
) -> pd.DataFrame:
"""Export facts to DataFrame.
Args:
pit_mode: If True, include filing_date and preserve all filing
versions (don't deduplicate by period). Enables point-in-time
analysis by filtering: df[df['filing_date'] <= as_of_date].
"""When pit_mode=True:
- Skip the period-based dedup step
- Include
filing_datecolumn in the output - Dedup key becomes
(concept, period_start, period_end, filing_date)— removes exact duplicates but preserves filing versions
Describe alternatives you've considered
- Always preserve filing versions: Would change default behavior and increase DataFrame size. Not backward-compatible.
- Separate
to_pit_dataframe()method: Possible but adds API surface. A parameter is simpler. - Include filing_date by default: Minimal change but doesn't solve the dedup issue.
Use Case Example
How would you use this feature?
from edgar import Company
import pandas as pd
company = Company("AAPL")
ef = company.get_facts()
# Standard mode (current behavior — for latest-value analysis)
df_latest = ef.to_dataframe()
# PIT mode (for backtesting — preserves all filing versions)
df_pit = ef.to_dataframe(pit_mode=True)
# Simulate "what was known on 2024-06-01?"
as_of = pd.Timestamp("2024-06-01")
known_facts = df_pit[df_pit["filing_date"] <= as_of]
# Get the latest-known value for each concept/period combination
pit_latest = (known_facts
.sort_values("filing_date")
.drop_duplicates(subset=["concept", "period_end"], keep="last"))
# Now pit_latest contains only data that was publicly available by 2024-06-01
# No lookahead bias!
revenue = pit_latest[pit_latest["concept"].str.contains("Revenue")]
print(f"Known revenue data points as of {as_of.date()}: {len(revenue)}")Implementation Considerations
Complexity Level:
- Simple (minor API addition)
- Moderate (new functionality with existing patterns)
- Complex (significant architectural changes)
Backwards Compatibility:
- This feature maintains backwards compatibility
- This feature might break existing code (please explain below)
- Unsure about compatibility impact
The default pit_mode=False preserves current behavior exactly.
Additional Context
- PIT analysis is critical for backtesting, academic research, and compliance
- Lookahead bias is the number 1 source of invalid backtesting results in quantitative finance
- No new data is needed —
filing_dateis already available onFinancialFact, just needs to be preserved through the export - The implementation is essentially: skip one dedup step and add one column
Related Issues/Features:
- Useful in combination with
quarterize()([FEATURE] Public Quarterization API for TTMCalculator #692) for quarterly PIT time series - The
FinancialFactalready hasfiling_date— this just needs to survive theto_dataframe()export
Feature requests are evaluated based on EdgarTools' core principles: Simple yet powerful, accurate financials, beginner-friendly, and joyful UX.