Distributed-Storage-and-Parallel-Computing

Based on the HDFS+MapReduce framework (Java implementation), this project conducts distributed computing of 20 Limit Order Book (LOB)-related quantitative factors using Shenzhen Stock Exchange Level-10 high-frequency snapshot data (3-second frequency). Key parameters are set as n=5 (order levels) and ∆t=1 (time interval), with 1e-7 added to denominators to avoid division by zero. Core tasks include calculating the average factor value sequences of CSI 300 index stocks (training phase: data from Jan 2-8, 2024) and outputting sequences for new data (testing phase), evaluated by technical documentation, presentations, and code performance (1% error tolerance, speed ranking). Through multi-dimensional optimizations—including CSV parsing enhancement, object reuse, memory control, Key compression, Shuffle optimization, concurrency tuning, and instruction-level improvements—computational efficiency is significantly improved while ensuring accuracy, enabling efficient distributed processing of high-frequency quantitative factor data.

Code Detailed

FactorCombiner.java: A Hadoop Combiner class that merges Factor objects output by Mappers based on their keys, accumulating factor values and counts to reduce data transfer in the Reduce phase.
DailyOutputFormat.java: A custom Hadoop output format that writes results to CSV files corresponding to the date part (MMDD) in the key, including headers and specific data rows.
Factor.java: A factor class implementing the Writable interface, containing 20 factor values and a count, with methods for calculation, merging, serialization, etc., used to store and process factor data.
FactorCalculationJob.java: A Hadoop job configuration class that sets various parameters of the MapReduce job (input/output formats, Mapper/Reducer classes, compression methods, memory configuration, etc.) and starts the job.
CompactTimeUtil.java: A time encoding utility class that compresses dates and times into 26-bit integers, supporting encoding, decoding, and time comparison to optimize storage and processing efficiency of time data.
SnapshotData.java: A snapshot data container class that stores transaction snapshot information (such as bid/ask prices, trading volumes, etc.) and provides methods for parsing byte arrays and copying data.
FastParser.java: A fast CSV parsing utility class that provides methods for parsing dates, long integers, strings, etc., with optimizations like loop unrolling to improve parsing efficiency.
RyuFloat.java: A floating-point to byte conversion utility class based on the Ryu algorithm, enabling efficient conversion of floating-point numbers to byte arrays with reduced memory allocation.
FactorMapper.java: A Hadoop Mapper class that parses input snapshot data, calculates factor values, caches them, and outputs to the Combiner/Reducer when a threshold is reached, including a custom cache structure for performance optimization.
FactorReducer.java: A Hadoop Reducer class that performs final aggregation on factor data output by the Combiner, calculates averages, and formats them for output.
DayPartitioner.java: A Hadoop partitioner class that partitions data based on the date part in the compressed time to optimize data distribution in the Reduce phase.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/main		src/main
Distributed_Storage_and_Parallel_Computing_Report.pdf		Distributed_Storage_and_Parallel_Computing_Report.pdf
Optim_track.md		Optim_track.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed-Storage-and-Parallel-Computing

Code Detailed

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed-Storage-and-Parallel-Computing

Code Detailed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages