The Brain Language Metrics on Company Filings (BLMCF) dataset has the objective of monitoring several language metrics on 10-Ks and 10-Qs company reports for approximately 6000+ US stocks.
Recent literature works claim inefficiencies in the market response to company filings information due to the increased complexity and length of such reports.
See for example
- “Lazy Prices” Cohen et al. 2018
- “ The Positive Similarity of Company Filings and the Cross-Section of Stock Returns”, M. Padysak 2020
- How to Use Lexical Density of Company Filings, D. Hanicova et al., 2021
This data set contains historical data from January 2010 and live data updated daily within 12pm UTC.
DATASET STRUCTURE AND KEY FIELDS
The dataset is constituted of a single schema "LANGUAGE_METRICS_COMPANY_FILINGS" and it can be logically divided in two parts.
The first part includes the language metrics of the most recent 10-K or 10-Q report for each firm and it is saved in the tables "METRICS_10K" (metrics for 10-K reports) and "METRICS_ALL" (metrics for 10-Ks and 10-Q reports).
The key metrics are:
1. Financial sentiment (field SENTIMENT)
2. Percentage of words belonging to financial domain classified by language types:
- “Constraining” language (field SCORE_CONSTRAINING)
- “Interesting” language (field SCORE_INTERESTING)
- “Litigious” language (field SCORE_LITIGIOUS)
- “Uncertainty” language (field SCORE_UNCERTAINTY)
3. Readability score (field READABILITY)
4. Lexical metrics such as lexical density and richness (fields LEXICAL_RICHNESS and LEXICAL_DENSITY)
5. Text statistics such as the report length and the average sentence length (fields N_SENTENCES and MEAN_SENTENCE_LENGTH)
The second part includes the differences between the two most recent 10-Ks or 10-Qs reports of the same period for each company and it is saved in the tables "DIFFERENCES_10K" (differences of metrics for 10-K reports) and "DIFFERENCES_ALL" (differences metrics for 10-Ks and 10-Q reports).
The key metrics are:
1. Difference of the various language metrics (e.g. delta sentiment, delta readability, delta percentage of a specific language type etc.). See for example the field DELTA_SENTIMENT that represents the difference of financial sentiment between the last available report and the previous report of same period and category.
2. Similarity metrics between documents, also with respect to a specific language type (for example similarity with respect to “litigious” language or “uncertainty” language). See for example the field SIMILARITY_ALL that represents the language similarity between the last available report and the previous report of same period and category.
The dataset includes the metrics and related differences both for the whole report and for specific sections (Risk Factors and Management Discussion and Analysis).
FACTSHEET
Link to factsheet: https://braincompany.co/assets/files/BLM_CF_V2_summary.pdf
DISCLAIMER
The content of this dataset is not to be intended as investment advice. The material is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory or other services by Brain. Brain makes no guarantees regarding the accuracy and completeness of the information expressed in the dataset.