Skip to content

jamiekt/jstark

Repository files navigation

jstark

PyPI - Version PyPI - Python Version build coverage pylint

Built with uv


A PySpark library for generating time-based features for machine learning. All features are calculated relative to an as at date, enabling point-in-time feature engineering over configurable time periods.

Feature period mnemonics

Feature names end with a mnemonic describing the time window. The format is {start}{unit}{end} where the unit is one of d (days), w (weeks), m (months), q (quarters) or y (years).

For example, BasketCount_3m1 is the distinct count of baskets from 3 months before to 1 month before the as at date.

Multiple periods can be calculated in a single Spark job:

from datetime import date
from jstark.grocery import GroceryFeatures

gf = (
    GroceryFeatures()
    .with_as_at(date(2022, 1, 1))
    .with_feature_periods(["3m1", "6m4"])
)
output_df = input_df.groupBy("Store").agg(*gf.features)

This produces BasketCount_3m1, BasketCount_6m4, and every other feature for both periods. See the Features reference for a list of all available features.

Quick start

Prerequisites: Java runtime required for PySpark. On macOS: brew install openjdk@11.

pip install jstark[faker]

The faker extra installs Faker, which is needed for the sample data generator used below. If you don't need sample data, pip install jstark is sufficient.

from datetime import date
from jstark.sample.transactions import FakeGroceryTransactions
from jstark.grocery import GroceryFeatures

input_df = FakeGroceryTransactions().df
gf = (
    GroceryFeatures()
    .with_as_at(date(2022, 1, 1))
    .with_feature_periods(["4q4", "3q3", "2q2", "1q1"])
)
output_df = input_df.groupBy("Store").agg(*gf.features)
output_df.select(
    "Store", "BasketCount_4q4", "BasketCount_3q3", "BasketCount_2q2", "BasketCount_1q1"
).show()
+-----------+---------------+---------------+---------------+---------------+
|      Store|BasketCount_4q4|BasketCount_3q3|BasketCount_2q2|BasketCount_1q1|
+-----------+---------------+---------------+---------------+---------------+
|    Staines|             47|             46|             48|             51|
| Twickenham|             55|             57|             48|             49|
|     Ealing|             52|             51|             50|             54|
|Hammersmith|             47|             40|             43|             51|
|   Richmond|             54|             40|             64|             53|
+-----------+---------------+---------------+---------------+---------------+

Try the demo notebook

A runnable Jupyter notebook is bundled with the wheel:

pip install jstark[jupyter]
jstark-demo
jupyter notebook GroceryFeatures_demo.ipynb

jstark-demo copies GroceryFeatures_demo.ipynb into your current directory. Use jstark-demo --force to overwrite an existing copy.

Feature descriptions and references

Every feature carries a description in its column metadata:

from pprint import pprint
pprint([(c.name, c.metadata["description"]) for c in output_df.schema if c.name.endswith("1q1")])
[('BasketCount_1q1',
  'Distinct count of Baskets between 2021-10-01 and 2021-12-31'),
 ...]

You can also inspect what input columns each feature requires:

gf.references["BasketCount_1q1"]                   # ['Basket', 'Timestamp']
gf.references["CustomerCount_1q1"]                 # ['Customer', 'Timestamp']
gf.references["AvgGrossSpendPerBasket_1q1"]        # ['Basket', 'GrossSpend', 'Timestamp']

All features require a Timestamp column (TimestampType). Most require additional columns depending on what they measure.

Features reference

Grocery features

A list of all Grocery features available if one were to call:

GroceryFeatures(date(2026, 1, 1), ["3m1"])
Feature Description
ApproxBasketCount_3m1 Approximate distinct count of Baskets between 2021-10-01 and 2021-12-31
ApproxCustomerCount_3m1 Approximate distinct count of Customers between 2021-10-01 and 2021-12-31
ApproxProductCount_3m1 Approximate distinct count of Products between 2021-10-01 and 2021-12-31
AverageBasketsPerMonth_3m1 Average number of baskets per month between 2021-10-01 and 2021-12-31
AvgDiscountPerBasket_3m1 Average Discount per Basket between 2021-10-01 and 2021-12-31
AvgGrossSpendPerBasket_3m1 Average GrossSpend per Basket between 2021-10-01 and 2021-12-31
AvgPurchaseCycle_3m1 Average purchase cycle between 2021-10-01 and 2021-12-31
AvgQuantityPerBasket_3m1 Average Quantity per Basket between 2021-10-01 and 2021-12-31
BasketCount_3m1 Distinct count of Baskets between 2021-10-01 and 2021-12-31
BasketMonths_3m1 Number of months in which at least one basket was purchased between 2021-10-01 and 2021-12-31
ChannelCount_3m1 Distinct count of Channels between 2021-10-01 and 2021-12-31
Count_3m1 Count of rows between 2021-10-01 and 2021-12-31
CustomerCount_3m1 Distinct count of Customers between 2021-10-01 and 2021-12-31
CyclesSinceLastPurchase_3m1 Cycles since last purchase between 2021-10-01 and 2021-12-31
Discount_3m1 Sum of Discount between 2021-10-01 and 2021-12-31
EarliestPurchaseDate_3m1 Earliest purchase date between 2021-10-01 and 2021-12-31
GrossSpend_3m1 Sum of GrossSpend between 2021-10-01 and 2021-12-31
MaxGrossPrice_3m1 Maximum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxGrossSpend_3m1 Maximum GrossSpend value between 2021-10-01 and 2021-12-31
MaxNetPrice_3m1 Maximum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxNetSpend_3m1 Maximum of NetSpend value between 2021-10-01 and 2021-12-31
MinGrossPrice_3m1 Minimum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MinGrossSpend_3m1 Minimum GrossSpend value between 2021-10-01 and 2021-12-31
MinNetPrice_3m1 Minimum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MinNetSpend_3m1 Minimum of NetSpend value between 2021-10-01 and 2021-12-31
MostRecentPurchaseDate_3m1 Most recent purchase date between 2021-10-01 and 2021-12-31
NetSpend_3m1 Sum of NetSpend between 2021-10-01 and 2021-12-31
ProductCount_3m1 Distinct count of Products between 2021-10-01 and 2021-12-31
Quantity_3m1 Sum of Quantity between 2021-10-01 and 2021-12-31
RecencyDays_3m1 Minimum number of days since occurrence between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the number of baskets per month between 2021-10-01 and 2021-12-31
StoreCount_3m1 Distinct count of Stores between 2021-10-01 and 2021-12-31
Mealkit features

A list of all Mealkit features available if one were to call:

MealkitFeatures(date(2026, 1, 1), ["3m1"])
Feature Description
AllergenCount_3m1 Distinct count of Allergens between 2021-10-01 and 2021-12-31
Allergens_3m1 Set of Allergens between 2021-10-01 and 2021-12-31
ApproxCustomerCount_3m1 Approximate distinct count of Customers between 2021-10-01 and 2021-12-31
ApproxOrderCount_3m1 Approximate distinct count of Orders between 2021-10-01 and 2021-12-31
ApproxProductCount_3m1 Approximate distinct count of Products between 2021-10-01 and 2021-12-31
ApproxRecipeCount_3m1 Approximate distinct count of Recipes between 2021-10-01 and 2021-12-31
AverageOrdersPerMonth_3m1 Average number of orders per month between 2021-10-01 and 2021-12-31
AvgPurchaseCycle_3m1 Average purchase cycle between 2021-10-01 and 2021-12-31
AvgQuantityPerOrder_3m1 Average Quantity per Order between 2021-10-01 and 2021-12-31
Count_3m1 Count of rows between 2021-10-01 and 2021-12-31
CuisineCount_3m1 Distinct count of Cuisines between 2021-10-01 and 2021-12-31
Cuisines_3m1 Set of Cuisines between 2021-10-01 and 2021-12-31
CustomerCount_3m1 Distinct count of Customers between 2021-10-01 and 2021-12-31
CyclesSinceLastOrder_3m1 Cycles since last order between 2021-10-01 and 2021-12-31
Discount_3m1 Sum of Discount between 2021-10-01 and 2021-12-31
EarliestPurchaseDate_3m1 Earliest purchase date between 2021-10-01 and 2021-12-31
MostRecentPurchaseDate_3m1 Most recent purchase date between 2021-10-01 and 2021-12-31
OrderCount_3m1 Distinct count of Orders between 2021-10-01 and 2021-12-31
OrderMonths_3m1 Number of months in which at least one order was placed between 2021-10-01 and 2021-12-31
ProductCount_3m1 Distinct count of Products between 2021-10-01 and 2021-12-31
Quantity_3m1 Sum of Quantity between 2021-10-01 and 2021-12-31
RecencyDays_3m1 Minimum number of days since occurrence between 2021-10-01 and 2021-12-31
RecipeCount_3m1 Distinct count of Recipes between 2021-10-01 and 2021-12-31

The following features provide insights into different cuisines:

Feature Description
AfricanCuisineCount_3m1 Count of African recipes between 2021-10-01 and 2021-12-31
AmericanCuisineCount_3m1 Count of American recipes between 2021-10-01 and 2021-12-31
ArgentinianCuisineCount_3m1 Count of Argentinian recipes between 2021-10-01 and 2021-12-31
AsianCuisineCount_3m1 Count of Asian recipes between 2021-10-01 and 2021-12-31
AustralianCuisineCount_3m1 Count of Australian recipes between 2021-10-01 and 2021-12-31
AustrianCuisineCount_3m1 Count of Austrian recipes between 2021-10-01 and 2021-12-31
BelgianCuisineCount_3m1 Count of Belgian recipes between 2021-10-01 and 2021-12-31
BrazilianCuisineCount_3m1 Count of Brazilian recipes between 2021-10-01 and 2021-12-31
BritishCuisineCount_3m1 Count of British recipes between 2021-10-01 and 2021-12-31
BulgarianCuisineCount_3m1 Count of Bulgarian recipes between 2021-10-01 and 2021-12-31
CajunCuisineCount_3m1 Count of Cajun recipes between 2021-10-01 and 2021-12-31
CambodianCuisineCount_3m1 Count of Cambodian recipes between 2021-10-01 and 2021-12-31
CanadianCuisineCount_3m1 Count of Canadian recipes between 2021-10-01 and 2021-12-31
CaribbeanCuisineCount_3m1 Count of Caribbean recipes between 2021-10-01 and 2021-12-31
CentralAmericaCuisineCount_3m1 Count of Central america recipes between 2021-10-01 and 2021-12-31
CentralAsiaCuisineCount_3m1 Count of Centralasia recipes between 2021-10-01 and 2021-12-31
ChineseCuisineCount_3m1 Count of Chinese recipes between 2021-10-01 and 2021-12-31
CubanCuisineCount_3m1 Count of Cuban recipes between 2021-10-01 and 2021-12-31
DanishCuisineCount_3m1 Count of Danish recipes between 2021-10-01 and 2021-12-31
DutchCuisineCount_3m1 Count of Dutch recipes between 2021-10-01 and 2021-12-31
EastAfricanCuisineCount_3m1 Count of East african recipes between 2021-10-01 and 2021-12-31
EastAsiaCuisineCount_3m1 Count of East asia recipes between 2021-10-01 and 2021-12-31
EasteuropeanCuisineCount_3m1 Count of Easteuropean recipes between 2021-10-01 and 2021-12-31
EgyptianCuisineCount_3m1 Count of Egyptian recipes between 2021-10-01 and 2021-12-31
EuropeanCuisineCount_3m1 Count of European recipes between 2021-10-01 and 2021-12-31
FilipinoCuisineCount_3m1 Count of Filipino recipes between 2021-10-01 and 2021-12-31
FrenchCuisineCount_3m1 Count of French recipes between 2021-10-01 and 2021-12-31
FusionCuisineCount_3m1 Count of Fusion recipes between 2021-10-01 and 2021-12-31
FusionCuisineCusiineCount_3m1 Count of Fusion-cuisine recipes between 2021-10-01 and 2021-12-31
GeorgianCuisineCount_3m1 Count of Georgian recipes between 2021-10-01 and 2021-12-31
GermanCuisineCount_3m1 Count of German recipes between 2021-10-01 and 2021-12-31
GreekCuisineCount_3m1 Count of Greek recipes between 2021-10-01 and 2021-12-31
HawaiianCuisineCount_3m1 Count of Hawaiian recipes between 2021-10-01 and 2021-12-31
HungarianCuisineCount_3m1 Count of Hungarian recipes between 2021-10-01 and 2021-12-31
IndianCuisineCount_3m1 Count of Indian recipes between 2021-10-01 and 2021-12-31
IndonesianCuisineCount_3m1 Count of Indonesian recipes between 2021-10-01 and 2021-12-31
IranianCuisineCount_3m1 Count of Iranian recipes between 2021-10-01 and 2021-12-31
IrishCuisineCount_3m1 Count of Irish recipes between 2021-10-01 and 2021-12-31
IsrealiCuisineCount_3m1 Count of Israeli recipes between 2021-10-01 and 2021-12-31
ItalianCuisineCount_3m1 Count of Italian recipes between 2021-10-01 and 2021-12-31
JamaicanCuisineCount_3m1 Count of Jamaican recipes between 2021-10-01 and 2021-12-31
JapaneseCuisineCount_3m1 Count of Japanese recipes between 2021-10-01 and 2021-12-31
KoreanCuisineCount_3m1 Count of Korean recipes between 2021-10-01 and 2021-12-31
LatinAmericanCuisineCount_3m1 Count of Latin american recipes between 2021-10-01 and 2021-12-31
LatinCuisineCount_3m1 Count of Latin recipes between 2021-10-01 and 2021-12-31
LebaneseCuisineCount_3m1 Count of Lebanese recipes between 2021-10-01 and 2021-12-31
MalaysianCuisineCount_3m1 Count of Malay recipes between 2021-10-01 and 2021-12-31
MediterraneanCuisineCount_3m1 Count of Mediterranean recipes between 2021-10-01 and 2021-12-31
MexicanCuisineCount_3m1 Count of Mexican recipes between 2021-10-01 and 2021-12-31
MiddleEasternCuisineCount_3m1 Count of Middleeastern recipes between 2021-10-01 and 2021-12-31
MidwestCuisineCount_3m1 Count of Midwest recipes between 2021-10-01 and 2021-12-31
MongolianCuisineCount_3m1 Count of Mongolian recipes between 2021-10-01 and 2021-12-31
MoroccanCuisineCount_3m1 Count of Moroccan recipes between 2021-10-01 and 2021-12-31
NewZealandCuisineCount_3m1 Count of New zealand recipes between 2021-10-01 and 2021-12-31
NordicCuisineCount_3m1 Count of Nordic recipes between 2021-10-01 and 2021-12-31
NorthAfricanCuisineCount_3m1 Count of North african recipes between 2021-10-01 and 2021-12-31
NorthAmericanCuisineCount_3m1 Count of North american recipes between 2021-10-01 and 2021-12-31
NorthamericaCuisineCount_3m1 Count of Northamerica recipes between 2021-10-01 and 2021-12-31
NortheastCuisineCount_3m1 Count of Northeast recipes between 2021-10-01 and 2021-12-31
NorthernEuropeanCuisineCount_3m1 Count of Northern europe recipes between 2021-10-01 and 2021-12-31
PacificIslandsCuisineCount_3m1 Count of Pacific-islands recipes between 2021-10-01 and 2021-12-31
PacificislandsCuisineCount_3m1 Count of Pacificislands recipes between 2021-10-01 and 2021-12-31
PeruvianCuisineCount_3m1 Count of Peruvian recipes between 2021-10-01 and 2021-12-31
PortugueseCuisineCount_3m1 Count of Portuguese recipes between 2021-10-01 and 2021-12-31
RussianCuisineCount_3m1 Count of Russian recipes between 2021-10-01 and 2021-12-31
ScandinavianCuisineCount_3m1 Count of Scandinavian recipes between 2021-10-01 and 2021-12-31
SingaporeanCuisineCount_3m1 Count of Singaporean recipes between 2021-10-01 and 2021-12-31
SouthAfricanCuisineCount_3m1 Count of South african recipes between 2021-10-01 and 2021-12-31
SouthAmericanCuisineCount_3m1 Count of South american recipes between 2021-10-01 and 2021-12-31
SouthAsiaCuisineCount_3m1 Count of South asia recipes between 2021-10-01 and 2021-12-31
SouthAsianCuisineCount_3m1 Count of South-asian recipes between 2021-10-01 and 2021-12-31
SouthEastAsianCuisineCount_3m1 Count of Southeast-asian recipes between 2021-10-01 and 2021-12-31
SouthHyphenAfricanCuisineCount_3m1 Count of South-african recipes between 2021-10-01 and 2021-12-31
SoutheastAsiaCuisineCount_3m1 Count of Southeast asia recipes between 2021-10-01 and 2021-12-31
SoutheastCuisineCount_3m1 Count of Southeast recipes between 2021-10-01 and 2021-12-31
SouthernEuropeCuisineCount_3m1 Count of Southern europe recipes between 2021-10-01 and 2021-12-31
SouthwestCuisineCount_3m1 Count of Southwest recipes between 2021-10-01 and 2021-12-31
SpanishCuisineCount_3m1 Count of Spanish recipes between 2021-10-01 and 2021-12-31
SrilankanCuisineCount_3m1 Count of Srilankan recipes between 2021-10-01 and 2021-12-31
SteakhouseCuisineCount_3m1 Count of Steakhouse recipes between 2021-10-01 and 2021-12-31
SwedishCuisineCount_3m1 Count of Swedish recipes between 2021-10-01 and 2021-12-31
ThaiCuisineCount_3m1 Count of Thai recipes between 2021-10-01 and 2021-12-31
TonganCuisineCount_3m1 Count of Tongan recipes between 2021-10-01 and 2021-12-31
TraditionalCuisineCount_3m1 Count of Traditional recipes between 2021-10-01 and 2021-12-31
TurkishCuisineCount_3m1 Count of Turkish recipes between 2021-10-01 and 2021-12-31
VietnameseCuisineCount_3m1 Count of Vietnamese recipes between 2021-10-01 and 2021-12-31
WestAfricanCuisineCount_3m1 Count of West african recipes between 2021-10-01 and 2021-12-31
WestHyphenAfricanCuisineCount_3m1 Count of West-african recipes between 2021-10-01 and 2021-12-31
WestafricaCuisineCount_3m1 Count of Westafrica recipes between 2021-10-01 and 2021-12-31
WesternEuropeCuisineCount_3m1 Count of Westerneurope recipes between 2021-10-01 and 2021-12-31
WesternEuropeanCuisineCount_3m1 Count of Western-european recipes between 2021-10-01 and 2021-12-31
ZanzibarianCuisineCount_3m1 Count of Zanzibarian recipes between 2021-10-01 and 2021-12-31

License

jstark is distributed under the terms of the MIT license.

Why "jstark"?

The name is phonetically similar to PySpark, is a homage to comic book character Jon Stark, and contains the initials of the original contributor (j, k & t).

About

feature generators based on pyspark

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors