BloomTech-Labs · KyleHaggin · Jan 30, 2020 · Jan 31, 2020 · Feb 2, 2020 · Feb 2, 2020
diff --git a/README.md b/README.md
@@ -43,35 +43,28 @@ Python, AWS, PostgreSQL, SQL, Flask
 
 ### Predictions
 
-The models folder contains two zip files, with a total of 30 models:
-
-tr_pickles.zip contains nine pickled trade recommender models.
-
-arb_models.zip contains 21 pickled arbitrage models.
-
-All 30 models use a RandomForestClassifier algorithm.
-
 Each trade recommender model recommends trades for a particular trading pair on a particular exchange by predicting whether the closing price will increase by enough to cover the costs of executing a trade.
 
-The arbitrage models predict arbitrage opportunities between two exchanges for a particular trading pair.  Predictions are made ten minutes in advance.  To count as an arbitrage opportunity, a price disparity between two exchanges must last for at least thirty minutes, and the disparity must be great enough to cover the costs of buying on one exchange and selling on the other.
-
-### Features
-
-Each of the nine trade recommender models is trained on 67 features.  Of those 67 features, five are taken directly from the OHLCV data (open, high, low, close, base_volume), one indicates where gaps were present in the data (nan_ohlcv), three indicate the time (year, month, day), and the remainder are technical analysis features.
+The arbitrage models predict arbitrage opportunities between two exchanges for a particular trading pair.  Predictions are made five minutes in advance. To count as an arbitrage opportunity, a price disparity between two exchanges must last for at least thirty minutes, and the disparity must be great enough to cover the costs of buying on one exchange and selling on the other.
 
-Each of the 21 arbitrage models is trained on 91 features.  Of those 91 features, three features indicate the time (year, month, day), and four indicate the degree and length of price disparities between two exchanges (higher_closing_price, pct_higher, arbitrage_opportunity, window_length).  Half of the remaining 84 features are specific to the first of the two exchanges in a given arbitrage dataset and are labelled with the suffix "exchange_1"; the other half are specific to the second of those two exchanges and are labelled with the suffix "exchange_2".  In each of these two sets of 42 features, two are taken directly from the OHLCV data (close_exchange_#, base_volume_exchange_#), one indicates where gaps were present in the data (nan_ohlcv), and the remainder are technical analysis features.
+The trained and pickled models can be accessed via the organization's AWS S3 bucket by the name of "crypto-buckit" under folder aws/models. Current code (as of 4 February 2020) will upload all future models into this S3 bucket.
 
-Technical analysis features were engineered with the Technical Analysis Library; they fall into five types:
+The naming convetion for the models is model\_{arbitrage/trade}\_{api}\_{trading\_pair}.pkl
 
-(1) Momentum indicators
+The predictions themselves can be accessed via the Organization's AWS RDS database with the table name "predictions".
 
-(2) Volume indicators
+### Features
 
-(3) Volatility indicators
+Each of the nine trade recommender models is trained on 80 features.  Of those 80 features, five are taken directly from the OHLCV data (open, high, low, close, base\_volume), and the remainder are technical analysis features. We are filling NaN values of open, high, low, close with the average price, and forward filling NaN values for base\_volume
 
-(4) Trend indicators
+Each of the arbitrage models is trained on 80 features.  Of those 80 features, and four indicate the degree and length of price disparities between two exchanges (higher_closing_price, pct_higher, arbitrage_opportunity, window_length).  Arbitrage is calculated by comparing the price of the primary exchange against the mean price of the other exchanges. This allows us to compare the market against every other market with minimal computation cost.
 
-(5) Others indicators
+Technical analysis features were engineered with the Technical Analysis Library; they fall into five types:<br/>
+(1) Momentum indicators<br/>
+(2) Volume indicators<br/>
+(3) Volatility indicators<br/>
+(4) Trend indicators<br/>
+(5) Others indicators<br/>
 
 Documentation for the technical analysis features features is available here:
 
@@ -81,14 +74,22 @@ Documentation for the technical analysis features features is available here:
 
 We obtained all of our data from the Cryptowatch, Bitfinex, Coinbase Pro, and HitBTC APIs. Documentation for obtaining that data is listed below:
 
-[Cryptowatch API OHLCV Data Documentation](https://developer.cryptowat.ch/reference/rest-api-markets#market-ohlc-candlesticks)
+[Cryptowatch REST API OHLCV Data Documentation](https://docs.cryptowat.ch/rest-api/)
 
 [Bitfinex API OHLCV Data Documentation](https://docs.bitfinex.com/reference#rest-public-candles)
 
 [Coinbase Pro API OHLCV Data Documentation](https://docs.pro.coinbase.com/?r=1#get-historic-rates)
 
 [HitBTC OHLCV Data Documentation](https://api.hitbtc.com/#candles)
 
+[Binance API OHLCV Documentation](https://github.com/binance-exchange/binance-official-api-docs/blob/master/rest-api.md)
+
+[Gemini REST API OHLCV Data Documentation](https://docs.gemini.com/rest-api/)
+
+[Kraken REST API OHLCV Data Documentation](https://www.kraken.com/en-us/features/api)
+
+[Poloniex API OHLCV Data Documentation](https://docs.poloniex.com/#introduction)
+
 ### Python Notebooks
 
 [Notebook Folder](https://github.com/Lambda-School-Labs/cryptolytic-ds/tree/master/finalized_notebooks)
@@ -117,6 +118,11 @@ Returns: ``` {"results":"{
 'prediction': 'result'}
 ]} ```
 
+### Internal Access via AWS
+The raw data and models can also be internally accessed through the Organization's AWS accounts.
+-AWS RDS: RDS databases holds historical candlestick data from the cryptocurrency market APIs as well as the prediction from the models.
+-AWS S3: S3 buckets holds the pickled, trained models for both Trading and Arbitrage.
+
 
 ## Contributing
 

diff --git a/cryptolytic/data/__init__.py b/cryptolytic/data/__init__.py
@@ -27,9 +27,12 @@ def denoise(signal, repeat):
 
 
 def resample_ohlcv(df, period=None):
-    """this function resamples ohlcv csvs for a specified candle interval; while
-       this can be used to change the candle interval for the data, it can also be
-       used to fill in gaps in the ohlcv data without changing the candle interval"""
+    """
+    this function resamples ohlcv csvs for a specified candle interval;
+    while this can be used to change the candle interval for the data,
+    it can also be used to fill in gaps in the ohlcv data without changing
+    the candle interval
+    """
     # dictionary specifying which columns to use for resampling
     ohlcv_dict = {'open': 'first',
                   'high': 'max',
@@ -38,7 +41,7 @@ def resample_ohlcv(df, period=None):
                   'volume': 'sum'}
 
     # apply resampling
-    if period==None:
+    if period is None:
         period = df['period'][0]
     period = pd.to_timedelta(period, unit='s')
     df_new = df.resample(period, how=ohlcv_dict)
@@ -52,8 +55,9 @@ def nan_df(df):
 
 def merge_candle_dfs(df1, df2):
     """Merge candle dataframes"""
-    merge_cols = ['trading_pair', 'exchange', 'period', 'datetime', 'timestamp']
-    df_merged = df1.merge(df2, how='inner', on=merge_cols) 
+    merge_cols = ['trading_pair', 'exchange',
+                  'period', 'datetime', 'timestamp']
+    df_merged = df1.merge(df2, how='inner', on=merge_cols)
     return df_merged
 
 
@@ -66,10 +70,14 @@ def outer_merge(df1, df2):
 
 
 def fix_df(df):
-    """Changes columns to the right type if needed and makes sure the index is set as the
-    datetime of the timestamp. Maybe better to have pandas infer numeric."""
+    """
+    Changes columns to the right type if needed and makes sure the index is
+    set as the datetime of the timestamp. Maybe better to have pandas
+    infer numeric.
+    """
     df['datetime'] = pd.to_datetime(df['timestamp'], unit='s')
-    numeric = ['period', 'open', 'close', 'high', 'low', 'volume', 'arb_diff', 'arb_signal']
+    numeric = ['period', 'open', 'close', 'high', 'low', 'volume',
+               'arb_diff', 'arb_signal']
     for col in numeric:
         if col not in df.columns:
             continue
@@ -80,18 +88,19 @@ def fix_df(df):
 
 def impute_df(df):
     """
-    Finds the gaps in the time series data for the dataframe, and pulls the average market 
-    price and its last volume for those values and places those values into the gaps. Any remaining
-    gaps or new nan values are filled with backwards fill.
+    Finds the gaps in the time series data for the dataframe, and pulls the
+    average market price and its last volume for those values and places those
+    values into the gaps. Any remaining gaps or new nan values are filled
+    with backwards fill.
     """
     df = df.copy()
     return df
     # resample ohclv will reveal missing timestamps to impute
-    gapped = resample_ohlcv(df) 
+    gapped = resample_ohlcv(df)
     gaps = nan_df(gapped).index
     # stop psycopg2 error with int conversion
     convert_datetime = compose(int, convert_datetime_to_timestamp)
-    timestamps = mapl(convert_datetime, list(gaps)) 
+    timestamps = mapl(convert_datetime, list(gaps))
     info = {'trading_pair': df['trading_pair'][0],
             'period': int(df['period'][0]),
             'exchange': df['exchange'][0],
@@ -107,24 +116,24 @@ def impute_df(df):
     df = fix_df(df)
     df['volume'] = df['volume'].ffill()
     df = df.bfill().ffill()
-    assert df.isna().any().any() == False
+    assert df.isna().any().any() is False
     return df
 
 
 def get_df(info, n=1000):
     """
-    Pull info from database and give it some useful augmentation for analysis. 
+    Pull info from database and give it some useful augmentation for analysis.
     TODO move functionality into get_data function in historical.
     """
     df = sql.get_some_candles(info=info, n=n, verbose=True)
     df = impute_df(df)
-    
+
     df['high_m_low'] = df['high'] - df['low']
     df['close_m_open'] = df['close'] - df['open']
     dfarb = sql.get_arb_info(info, n)
 
     merged = merge_candle_dfs(df, dfarb)
-    assert merged.isna().any().any() == False
+    assert merged.isna().any().any() is False
     return merged
 
 
@@ -135,11 +144,11 @@ def thing(arg, axis=0):
     return x, mu, std
 
 
-# Version 2 
+# Version 2
 def normalize(A):
     if isinstance(A, pd.DataFrame) or isinstance(A, pd.Series):
         A = A.values
-    if np.ndim(A)==1:
+    if np.ndim(A) == 1:
         A = np.expand_dims(A, axis=1)
     A = A.copy()
     x, mu, std = thing(A, axis=0)
@@ -149,22 +158,24 @@ def normalize(A):
         # from sql
         A[:, i] = (x[:, i] - mu[i]) / std[i]
     return A
-   
+
 
 def denormalize(values, df, col=None):
-    """Denormalize, needs the original information to be able to denormalize."""
+    """
+    Denormalize, needs the original information to be able to denormalize.
+    """
     values = values.copy()
-    
+
     def eq(x, mu, std):
         return np.exp((x * std) + mu) - 1
-    
+
     if np.ndim(values) == 1 and col is not None:
         x, mu, std = thing(df[col])
         return eq(values, mu, std)
     else:
-        for i in range(values.shape[1]): 
+        for i in range(values.shape[1]):
             x, mu, std = thing(df.iloc[:, i])
-            if isinstance(values, pd.DataFrame): 
+            if isinstance(values, pd.DataFrame):
                 values.iloc[:, i] = eq(values.iloc[:, i], mu, std)
             else:
                 values[:, i] = eq(values[:, i], mu, std)
@@ -177,29 +188,30 @@ def windowed(df, target, batch_size, history_size, step, lahead=1, ratio=0.8):
     """
     xs = []
     ys = []
-    
+
     x = df
     y = df[:, target]
 
-    start = history_size # 1000
-    end = df.shape[0] - lahead # 4990
+    start = history_size  # 1000
+    end = df.shape[0] - lahead  # 4990
     # 4990 - 1000 = 3990
     for i in range(start, end):
         indices = range(i-history_size, i, step)
         xs.append(x[indices])
         ys.append(y[i:i+lahead])
-        
+
     xs = np.array(xs)
     ys = np.array(ys)
-    
+
     nrows = xs.shape[0]
     train_size = int(nrows * ratio)
-    # make sure the sizes are multiples of the batch size (needed for types of models)
+    # make sure the sizes are multiples of the batch size
+    # (needed for types of models)
     train_size -= train_size % batch_size
     val_size = nrows - train_size
-    val_size -= val_size  % batch_size
+    val_size -= val_size % batch_size
     total_size = train_size + val_size
     xs = xs[:total_size]
     ys = ys[:total_size]
-    
+
     return xs[:train_size], ys[:train_size], xs[train_size:], ys[train_size:]
diff --git a/cryptolytic/data/aws.py b/cryptolytic/data/aws.py
@@ -32,4 +32,8 @@ def get_path(folder_name, model_type, exchange_id, trading_pair, ext):
     aws_folder = os.path.join('aws', folder_name)
     if not os.path.exists(aws_folder):
         os.mkdir(aws_folder)
-    return os.path.join(aws_folder, f'model_{model_type}_{exchange_id}_{trading_pair}{ext}')
+    return os.path.join(
+        aws_folder, f'model_{model_type}_{exchange_id}_{trading_pair}{ext}'
+        # Windows operating systems use \\ instead of /, replace function
+        # required to conform with Unix operating systems
+        ).replace('\\', '/')