diff --git a/.gitignore b/.gitignore
index ef7be74..d5ae3a7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -211,7 +211,7 @@ dmypy.json
### VisualStudioCode ###
.vscode/*
-!.vscode/settings.json
+.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
@@ -229,8 +229,11 @@ model
train/runs
train/output*
train/models
-
+train/ru_gpt2
# prep files
data/prep
# jokes database
-jokes.db
\ No newline at end of file
+!jokes.db
+*.db
+# log files
+*.log
\ No newline at end of file
diff --git a/README.md b/README.md
index 2fa3115..ad34c10 100644
--- a/README.md
+++ b/README.md
@@ -1,254 +1,109 @@
-# Modern Application Production
-
-Language: Python
-
-Version Control: GitHub
-
-Tests: Python unittest package
-
-Tasks: [Trello](https://https://trello.com/invite/b/6k7nigbp/8d26ddad39c33393cf6053e7ac0ac4ac/pmldl)
-
-User Stories: [GitHub issues](https://github.com/FrogTravel/PMLDL/issues)
-
-We are using feature-oriented branches
-
-Team: Irina Podlipnova, Boris Guryev, Ekaterina Levchenko (Scram Master), Vladislav Kuleykin
-
-## Project Proposal
-
-### General
-
-These days jokes and memes become part of modern people's life. Each day everybody spends time consuming content. Communities like *Reddit* or *9GAG* periodically release meme calendars.
-
-We propose a question-answer joke generator telegram bot with the aim to distinguish if artificially generated content can entertain users at the same level as human-created jokes.
-
-### Training
-For training, we will apply transfer learning. As a starting model we will try different state-of-the-art language models, such as *BERT* [2], *GPT-2* [3]. As a training dataset will use set from *Kaggle* [1], which we will extend with other 2-liner jokes from the internet.
-
-### Evaluation
-Joking is a subjective topic, so we are planning to ask people to evaluate samples of generated jokes. And afterward, we'll compare the results for different models.
-
-### Expected Final Result
-We are planning to deploy this project into telegram infrastructure for simple interaction with the user. The flexibility of this platform covers all requirements described above such as: deliver content, collecting evaluation from users and easy deployment.
-
-1) [ Kaggle dataset](https://www.kaggle.com/jiriroz/qa-jokes )
-2) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
-Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
-3) Language Models are Unsupervised Multitask Learners. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
-
-## Sprint 1
-### Burndown chart
-
-storypoints on vertical x days on horizontal axis
-Blue - ideal burndown, Green - our result
-
-## Sprint 2 & 3
-### Burndown charts
-#### Sprint 2
-
-storypoints on vertical x days on horizontal axis
-Blue - ideal burndown, Green - our result
-#### Sprint 3
-
-storypoints on vertical x days on horizontal axis
-Blue - ideal burndown, Green - our result
-
-Below are reports that we made for our course assignments. They might be different from the real tasks in trello in some way
-
-### Dataset
-The dataset we going to use is taken from Kaggle competition [4]. It contains more than 38000 samples of so-called question-answer jokes. All jokes are parsed from Reddit.
-
-
-Jokes examples:
-
- Q: Why is Yoda the worst copilot?
- A: "Yoda, are we still going the right way?"
- "Off course we are"
-
- Q: What do you call a blonde who dyes her hair brown?
- A: AI (Artificial Intelligence)
-
-
- Q: Why do programmers confuse Christmas and Halloween?
- A: Because Dec 25 is Oct 31
-
-Also, we'd like to extend Kaggle's dataset with some more examples because we still think that the number of jokes in the dataset is not enough to train a good model. Moreover, some jokes are duplicated, so it makes the model to "overfit" and generate a lot of similar jokes. There can be at least 3 possible ways to solve this problem: add more examples, so to neglect the influence of duplicates on the model; exclude duplicates from the dataset using some measurement of similarity or just ignore it assuming that this will not lead to overfitting.
-
-Finally, we need to clear the dataset and make some pre-processing because jokes are just parsed from Reddit and not cleaned up. So there can be some notes inside jokes, explanations, editions marks, some credits to authors of jokes, jokes can be in the wrong format and etc. Below some examples of such cases:
-
-Initial version:
-
- Q: How do Muslims laugh?
- A: Muahahahamed Note: I don't have any prejudices against Islamic
-
-Pre-processed version:
-
- Q: How do Muslims laugh?
- A: Muahahahamed
-
-Initial version:
-
- Q: Why did Eric Clapton make the switch from PC to Apple?
- A: Well because he had a horrible experience with windows. (credit to Neil Hamburger for this amazing joke)
-
-Pre-processed version:
-
- Q: Why did Eric Clapton make the switch from PC to Apple?
- A: Well because he had a horrible experience with windows.
-
-Initial version:
-
- Q: What do you have left after you burn a French alphabet?
- A: H Edit: I don't like explaining jokes but since the first guy didn't get I might as well: When pronounced in a French accent it sounds like ash.
-
-Pre-processed version:
-
- Q: What do you have left after you burn a French alphabet?
- A: H
-Initial version:
-
- Q: Why does Santa have three gardens?
- A: Q: Why does Santa have three gardens? A: So he can ""hoe, hoe, hoe.""
-
-Pre-processed version:
-
- Q: Why does Santa have three gardens?
- A: So he can ""hoe, hoe, hoe.""
-
-
-### Telegram bot
-#### Reasons
-We decided to use a Telegram bot because it is a very lightweight solution to communicate with a user and retrieve feedback from them. The flexibility of the bot API allows us to implement all planned features and possible increments in the future.
-
-#### Functional description
-Our primary feature is the question-answer jokes generation when the joke sends by a user and funny candidate answer returned by bot. Each generated joke will include two inline buttons: "Thumb-up" and "Thumb-down". Then we can collect accepted by humans jokes to pool and use them in future: a) for fine-tuning; b) for sharing funny joke responses. There are two ways of doing so - directly in a bot or in a separate channel.
-
-
-
-### Language models review
-#### BERT
-Still a famous and widely used model. However, initially this model natural to use for tasks like classification, named entity detection or casual embedding [1]. However, there are several improvements on BERT adopted to model language [2] or trained/fine-tuned for modeling language
-
-For now, we took the pre-trained model from [2] and produced several samples for text generation with non-empty (provided by us) seed sentences:
-
- Q: who touched my pasta ?
- A: bad people , like me . like pretty girl .
- but so far i was a little hungry .
-
- Q: who touched my spaghetti ?
- A: nobody had never touched my hand ,
- pleasant bo ##le , who had not played any such game yet ?
-
-So, even pre-trained for sentence generation, this model still can return some meaningful in context respond as well as fluctuated sequence. Still, we will try to fine-tune the model with our data to evaluate final behavior.
-
-There are many BERT modifications and one of them could be applicable for our task. We found the Question-Answer dataset [3] with links to implementation papers. Probably one of their models or training approaches will be significant for our work.
-
-
-
-### GPT-2
-#### Model selection
-GPT-2 is the current state-of-the-art generative transformer model. It's a direct scale-up of their previous GPT model, the only change is the number of parameters ($>$10x more), and new, cleaner dataset (also $>$10x bigger). GPT-2 achieves state-of-the-art scores on a variety of domain-specific language modeling tasks without any fine-tuning.
-
-Also, GPT-2 has models of different sizes\footnote{We use the names from the https://huggingface.co/transformers/v2.3.0/pretrained_models.html transformers framework documentation, and as we have multiple limitations:
-
-- *processing power* - current consumer GPUs can't even fit \texttt{gpt2-large} and \texttt{gpt2-xl} models for training, so our focus will be on a smaller versions;
-- *small dataset* - as our dataset is quite small (3.6 MB) compared to original (40 GB), we'll not be able to fine-tune big models, but for the small models, this is a right amount of data, as they fine-tune fast;
-- *inference time* - and as we plan to deploy the final model, we need to think about inference time, as it needs to be fast, for the bot to provide in-time responses.
-
-
-So, for now we'll consider gpt-2, gpt-2-medium and DistilGPT2 models.
-
-### Fine-tuning results
-For now we tried to fine-tune the gpt-2 and gpt-2-medium models, and their results were not distinct from each other, so I'll cover them jointly.
-
-Firstly, disclaimer: there's a lot of filthy words, as the dataset is full of them.
-
-The model generates a lot of classic cross the road, change the bulb jokes, which again can come from a bad jokes balance in the dataset:
-
- Q: Why did the monkey cross the road?
- A: For his brother's sake.
-
- Q: How many babies does it takes to screw in a lightbulb?
- A: About 4,000.
-
- Q: How many feminists does it take to change a lightbulb?
- A: None. They can't change anything.
-
-Most of the other jokes make no sense:
-
- Q: What is Gordon Ramsay's favorite beverage?
- A: Mountain Bison
-
- Q: Why is it so windy in Russia?
- A: Because everything is worth it.
-
- Q: What do Jesus and the world have in common?
- A: They are all tied up by a knot of knotches.
-
- Q: Why do Chinese men laugh when circumcised?
- A: This is how they greet the penis, not theirs.
-
- Q: Who is Mario?
- A: More of a dictator who can watch video games while hiding in secrets.
-
-And sometimes...
-
-
- Q: What does the man and an egg sandwich have in common?
- A: They both have eggs.
-
- Q: What do you call a person that has a fetish for cheese?
- A: A cheetah.
-
- Q: What's the most popular food at a gay barbecue?
- A: KFC.
-
- Q: Where do poor people live?
- A: In India.
-
-### References
-1) https://github.com/huggingface/transformers/issues/401 - Transformers package "How can I generate new text after having fine-tuned BERT on a custom dataset?" issue discussion.
-2) https://arxiv.org/pdf/1902.04094.pdf - BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model
-3) https://rajpurkar.github.io/SQuAD-explorer/ - SQuAD - The Stanford Question Answering Dataset
-4) https://www.kaggle.com/jiriroz/qa-jokes - Kaggle. Question-Answer Jokes
-5) https://github.com/FrogTravel/PMLDL - Our Github repository
-
-
-## Sprint 4
-### Burndown chart
-
-storypoints on vertical x days on horizontal axis
-Blue - ideal burndown, Green - our result
-### Overview
-This week we focused on the prototype development. We contributed in the Telegram bot, deployment part and improvement of the fine-tuning of the GPT-2.
-
-### Telegram bot
-During this sprint, we implemented the last part for our bot interface. Our bot has two use cases. The first one is the /joke command, which generates the joke as a pair "Joke-Answer." Another case only the "answer" generation. To use this function, send a question to our bot, it will concern any text without backslash as a question for a generation. Though we will change this in the next spring to reduce the load on the server.
-After generation, the bot will display two buttons - thumb up and thumb down. The user might grade the joke. After that, the "grading" message will be changed to "Thank you for your feedback."
-
-### Deployment
-First of all, we reviewed open source solutions for wrapping models into web API.
-
-*Cortex* [2] - an open-source platform for deploying machine learning models as production web services. Provides logging, auto-scaling, forwarding GPU into the container, etc. However, this framework is designed to be self-hosted on AWS infrastructure. During this project, we aimed to run anything locally, because of the hardware requirements of the selected language model.
-
-*Model Asset Exchange (MAX)* [3] - this is a project template wrapped around flask python backend framework. We found it over architected because the main goal of this project is to force developers to provide API documentation to the endpoints.
-
-Finally, we decided to rewrite the CLI of the transformers repository [1] in simple format convenient for our task. In our case, it is enough to keep the model in memory - usual python class instance - and run forward whenever bot receives a message from the user. But, as requests can be processed in an asynchronous manner (especially in time of presentation), we found an issue of GPU memory leak when more than 1 requests processed.
-
-### Training
-For the training part, this week we worked on improving the fine-tuning of the GPT-2 model:
-
-- *Fixed preprocessing* - we had a few errors as we forgot to put special tokens for the start and end of the documents, so now they are fixed.
-- *Experimented with different frameworks* - we experimented with a couple of TensorFlow/Pytorch frameworks for fine-tuning the GPT-2, transformers [1] and gpt-2-simple [5]. And now are settled at the transformers, with the planning of some adoption of functionality from the other one.
-- *Collected new dataset for fine-tuning* - as our previous results were repetitive and not diverse, as we think it is because of the small size of the dataset (3 MB), so we proposed an additional dataset of stand up transcripts (13 MB), which we gathered ourselves. We haven't yet tested our proposal.
-
-### References
-1) https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py - Transformers. Run language model example
-2) https://github.com/cortexlabs/cortex/ Cortex. Cloud native model serving infrastructure
-3) https://developer.ibm.com/tutorials/getting-started-with-the-ibm-code-model-asset-exchange/ IBM Model Asset Exchange
-4) https://github.com/FrogTravel/PMLDL}{Our Github repository
-5) https://github.com/minimaxir/gpt-2-simple}{\texttt{gpt-2-simple GitHub page
-
-## Sprint 5
-
-We are now at Sprint 5. We are connecting our bot interface to a backend and making tests for a bot. We fine-tune the final model at the backend.
+# Joke Generator Bot
+A Question-Answer joke generator bot for Telegram. As a joke generator we used a fine-tuned GPT-2 model.
+
+
+
+
+
+
+**Authors**:
+* Vlad Kuleykin
+* Boris Guryev
+* Ekaterina Levchenko
+* Irina Podlipnova
+
+## Purposes
+This project was done during the *Practical Machine Learning and Deep Learning* course at Spring 2020 semester at *Innopolis University* (Innopolis, Russia). You can find our technical report in `report.pdf` file.
+
+## How to Use
+Put in the bot token and path to the model in `bot.cfg`, after which you can run the next command:
+```cmd
+python3 bot\main_bot.py
+```
+And you have a working bot!
+
+All dispatched jokes will be stored in `jokes.db` with the feedback in a `vote` table.
+
+Also, you can see the run log in stdout and `run.log` file:
+
+```log
+2020-05-11 14:40:23,546 - DS: qa_jokes.csv - INFO - Got joke from dataset
+2020-05-11 14:40:25,613 - ML: output_one - INFO - Got joke from the buffer
+2020-05-11 14:40:27,483 - DS: shqa_jokes.csv - INFO - Got joke from dataset
+2020-05-11 14:40:29,205 - ML: output_8 - INFO - Filling the buffer
+2020-05-11 14:40:29,205 - ML: output_8 - INFO - Got joke from the buffer
+```
+Where `DS` stands for the dataset and `ML` stands for the model.
+
+
+### A/B Testing
+For the purpose of testing the quality of our bot we added the ability to test the models against the jokes from datasets.
+
+To do this, change the flag `ab_test` in configuration file to `true` and add the wanted model paths and dataset paths to corresponding fields with the `,` as delimeter.
+
+The bot will randomly choose the sourse for the next joke and will put the model/dataset path in `generated_by` column in database.
+
+### Russian model (can be any other language)
+I added the feature to wrap the original models with the model for other language.
+So, currently, if you provide `rus_model_path` field in the configuration file, bot will create a wrapping around the original model with the provided one. And if it will find cirrilic text in the promt, or command `\шутка` it will use the Russian model to generate the answer.
+
+## How to train a model
+To train the model, we have three datasets:
+* **QA Jokes** (3.29 MB) - the original dataset we've found on [Kaggle][1]. It contains ~38k question-answer jokes
+* **Short Jokes** (22.8 MB) - the biggest dataset in our collection, it was also found on [Kaggle][2] and consists of ~231k short length jokes from Twitter and Reddit. But it also contains a lot of noise and misspellings
+* **Stand up transcripts** (13.5 MB) - the manually scraped dataset of stand up transcripts from one [site][3]
+
+But you're free to use others! (and please, write me if you found good one)
+
+For the training, the `GPT-2 train helper.ipynb` in train folder can come in handy. As it can convert the datasets to appropriate GPT-2 input files and extract the QA jokes from the *Short Jokes* dataset.
+
+And as for the actual training use the `run_lm_finetuning.py` script merged from two other scripts from [Transformers library][4] and [ru_transformers repo][6]:
+```cmd
+python3 run_lm_finetuning.py \
+ --model_type=gpt2 \
+ --model_name_or_path=**INPUT_MODEL_PATH** \
+ --output_dir=**OUTPUT_MODEL_PATH** \
+ --learning_rate=1e-05 \
+ --num_train_epochs=10 \
+ --per_gpu_train_batch_size=2 \
+ --gradient_accumulation_steps=8 \
+ --logging_steps=100 \
+ --save_steps=1000 \
+ --unfreeze_level 0 \
+ --train_data_file=**TRAIN_DATA_PATH** \
+ --do_train \
+```
+If the model doesn't fit in your GPU, try changing the `block_size` or `per_gpu_train_batch_size`.
+As for the tips how to train the model, visit the [ru_transformers repo][6].
+
+To test the current model run the `run_generation.py` script:
+```cmd
+python3 run_generation.py \
+ --model_type=gpt2 \
+ --model_name_or_path=**TRAINED_MODEL_PATH** \
+ --prompt="[QUESTION]" \
+ --length=60 \
+ --stop_token="<|endoftext|>" \
+ --temperature=0.9 \
+ --repetition_penalty=1.05 \
+ --k=50 \
+ --p=0.95 \
+ --num_return_sequences=40
+```
+Feel free to experiment with the `temperature`, `k`, `p` and `repetition_penalty`, to get better insights as to what do this arguments do, visit this [link][5].
+
+### Russian model
+To train the russian model, please refer to the [ru_transformers repo][6] instructions. I only fixed some errors in the scripts provided there. So, if you would want to train using my script, please instead of `--tokenizer_class YTEncoder` flag provide `--model_type=gpt2-yttm`. Everything else should be the same.
+
+
+[1]: https://www.kaggle.com/jiriroz/qa-jokes "QA Jokes dataset"
+
+[2]: https://www.kaggle.com/abhinavmoudgil95/short-jokes "Short Jokes dataset"
+
+[3]: https://render.githubusercontent.com/view/scrapsfromtheloft.com "Stand Up transcripts site"
+
+[4]: https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py "Transformers. Run language model example"
+
+[5]: https://huggingface.co/blog/how-to-generate "Hugging face. How to generate"
+
+[6]: https://github.com/mgrankin/ru_transformers "Russian GPT-2"
diff --git a/bot/bot.cfg b/bot/bot.cfg
index c6e15fd..ccb88e3 100644
--- a/bot/bot.cfg
+++ b/bot/bot.cfg
@@ -3,6 +3,7 @@ token = ...
ab_test = false
[model]
model_paths = model
+rus_model_path = rus_model
dataset_paths = data/qa_jokes.csv
max_joke_len = 40
buffer_size = 16
diff --git a/bot/data.py b/bot/data.py
new file mode 100644
index 0000000..99317e2
--- /dev/null
+++ b/bot/data.py
@@ -0,0 +1,38 @@
+import os
+import logging
+import pandas as pd
+
+
+class Dataset:
+ """Wrapper for the DataFrame to return values similar
+ to `AbstractJokeGenerator` output.
+ """
+
+ def __init__(self, dataset_path, promt_token, answer_token):
+ self.promt_token = promt_token
+ self.answer_token = answer_token
+ self.name = os.path.split(dataset_path)[1]
+ self.data = pd.read_csv(dataset_path)
+ self.logger = logging.getLogger("DS: " + self.name)
+
+ def __getitem__(self, idx):
+ question = self.data['Question'].iloc[idx].strip()
+ answer = self.data['Answer'].iloc[idx].strip()
+ text = (self.promt_token + question + '\n'
+ + self.answer_token + ' ' + answer)
+ self.logger.info('Got joke from dataset')
+ return {
+ 'text': text,
+ 'generated_by': self.name,
+ }
+
+ def __len__(self):
+ return len(self.data)
+
+
+class Joke:
+ """An interface class for a Joke."""
+
+ def __init__(self, text, id):
+ self.id = id
+ self.text = text
diff --git a/bot/inference.py b/bot/inference.py
index 67b5150..6044982 100644
--- a/bot/inference.py
+++ b/bot/inference.py
@@ -1,4 +1,6 @@
import torch
+from torch import LongTensor
+import logging
from transformers import (
CTRLLMHeadModel,
@@ -14,9 +16,11 @@
XLNetLMHeadModel,
XLNetTokenizer,
)
+from yt_encoder import YTEncoder
MODEL_CLASSES = {
"gpt2": (GPT2LMHeadModel, GPT2Tokenizer),
+ "gpt2-yttm": (GPT2LMHeadModel, YTEncoder),
"ctrl": (CTRLLMHeadModel, CTRLTokenizer),
"openai-gpt": (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
"xlnet": (XLNetLMHeadModel, XLNetTokenizer),
@@ -24,38 +28,45 @@
"xlm": (XLMWithLMHeadModel, XLMTokenizer),
}
+# Don't show warnings.
+logging.getLogger("transformers.tokenization_utils").setLevel(logging.ERROR)
+logging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)
+logging.getLogger("transformers.configuration_utils").setLevel(logging.ERROR)
+
class ModelWrapper:
def __init__(self, model_path, model_name,
device='cpu',
model_type='gpt2',
- max_length=40,
+ max_len=40,
temperature=0.9,
- num_return_sequences=1,
+ n_return_sequences=1,
repetition_penalty=1.0,
k=50,
- p=0.95,):
- self.num_return_sequences = num_return_sequences
+ p=0.95,
+ **kwargs):
+ self.n_return_sequences = n_return_sequences
self.repetition_penalty = repetition_penalty
self.k = k
self.p = p
self.temperature = temperature
- self.device = device = torch.device(device)
- self.max_length = max_length
+ self.device = torch.device(device)
+ self.max_length = max_len
model_class, tokenizer_class = MODEL_CLASSES[model_type]
self.tokenizer = tokenizer_class.from_pretrained(model_path)
self.model = model_class.from_pretrained(model_path)
self.name = model_name
- self.model.to(device)
+ self.logger = logging.getLogger('ML: ' + self.name)
+ self.model.to(self.device)
def __encode(self, text):
- encoded_prompt = self.tokenizer.encode(
- text, add_special_tokens=False, return_tensors="pt")
+ encoded_prompt = self.tokenizer.encode(text, add_special_tokens=False)
+ encoded_prompt = LongTensor(encoded_prompt).unsqueeze(0)
return encoded_prompt.to(self.device)
- def generate(self, beginning, num_return_sequences=None):
- if num_return_sequences is None:
- num_return_sequences = self.num_return_sequences
+ def generate(self, beginning, n_return_sequences=None):
+ if n_return_sequences is None:
+ n_return_sequences = self.n_return_sequences
encoded_prompt = self.__encode(beginning)
output_sequences = self.model.generate(
input_ids=encoded_prompt,
@@ -65,7 +76,7 @@ def generate(self, beginning, num_return_sequences=None):
top_p=self.p,
repetition_penalty=self.repetition_penalty,
do_sample=True,
- num_return_sequences=num_return_sequences,
+ num_return_sequences=n_return_sequences,
)
# Remove the batch dimension when returning multiple sequences
@@ -86,7 +97,7 @@ def generate(self, beginning, num_return_sequences=None):
# dt = datetime.datetime.now() - start
# print("\t", dt)
# Batch processing test
- res = m.generate("[QUESTION] ", num_return_sequences=4)
+ res = m.generate("[QUESTION] ", n_return_sequences=4)
for j in res:
print(j)
diff --git a/bot/joke.py b/bot/joke.py
deleted file mode 100644
index 91eee0d..0000000
--- a/bot/joke.py
+++ /dev/null
@@ -1,4 +0,0 @@
-class Joke:
- def __init__(self, text, id):
- self.id = id
- self.text = text
diff --git a/bot/joke_generator.py b/bot/joke_generator.py
index d51347f..7825bb2 100644
--- a/bot/joke_generator.py
+++ b/bot/joke_generator.py
@@ -1,18 +1,17 @@
-import random
import os
-import threading
import re
-from itertools import cycle
+import random
+import threading
+import logging
+from abc import ABC, abstractmethod
-import pandas as pd
import storage
from inference import ModelWrapper
-from joke import Joke
-
-from abc import ABC, abstractmethod
+from data import Dataset, Joke
def synchronized(func):
+ """Decorator for the syncronized usage of the function."""
func.__lock__ = threading.Lock()
def synced_func(*args, **kws):
@@ -21,19 +20,28 @@ def synced_func(*args, **kws):
return synced_func
+
class AbstractJokeGenerator(ABC):
"""Abstract class for joke generation using `ModelWrapper`."""
- default_promt_token = '[QUESTION]'
- answer_token = '[ANSWER]'
- custom_promt = f'{default_promt_token}{{}}\n{answer_token}'
- stop_token = '<|endoftext|>'
POS_GRADE = 1
NEG_GRADE = -1
- def __init__(self, jokes_buffer_size):
+ def_config = {
+ 'buffer_size': 16,
+ 'max_len': 50,
+ 'device': 'cpu',
+ 'promt_token': '[QUESTION]',
+ 'answer_token': '[ANSWER]',
+ 'custom_promt': '[QUESTION]{}\n[ANSWER]',
+ 'stop_token': '<|endoftext|>',
+ }
+
+ def __init__(self, config=None):
self.store = storage
- self.jokes_buffer_size = jokes_buffer_size
+ self.config = AbstractJokeGenerator.def_config.copy()
+ if config:
+ self.config.update(config)
def positive_grade(self, user_id, joke_id):
self.store.add_or_update_vote(
@@ -42,17 +50,26 @@ def positive_grade(self, user_id, joke_id):
def negative_grade(self, user_id, joke_id):
self.store.add_or_update_vote(
joke_id=joke_id, user_id=user_id, rating=self.NEG_GRADE)
-
+
+ def _escape_markdown(self, text):
+ """Escape the possible markdown tokens in the text generated by model."""
+ return re.sub(r'((([_*]).+?\3[^_*]*)*)([_*])', "\g<1>\\\\\g<4>", text)
+
def _prettify_result(self, model_output):
def pp_answer(text):
"""Pretty-print the answer."""
- # Remove all text after the stop token
- text = text[: text.find(self.stop_token) if self.stop_token else None]
- # Remove multiple answers
- text = self.answer_token.join(text.split(self.answer_token, 2)[:2])
+ # Remove all text after the stop token.
+ stop_token_ind = text.find(self.config.get('stop_token', None))
+ if stop_token_ind > -1:
+ text = text[:stop_token_ind]
+ # Remove multiple answers.
+ text = self.config['answer_token'].join(
+ text.split(self.config['answer_token'], 2)[:2])
+ # Escape markdown tokens.
+ text = self._escape_markdown(text)
# Replace model tokens with html formatted ones.
- text = re.sub(f'\{self.default_promt_token} *', 'Question: ', text)
- text = re.sub(f'\{self.answer_token} *', '\nAnswer: ', text)
+ text = re.sub(f'\{self.config["promt_token"]} *', '*Question*: ', text)
+ text = re.sub(f'\{self.config["answer_token"]} *', '\n*Answer*: ', text)
return text
if isinstance(model_output, str):
@@ -60,164 +77,201 @@ def pp_answer(text):
else:
return [pp_answer(ans) for ans in model_output]
- @synchronized
def generate_joke(self, model, promt=""):
"""Generate the joke from given promt.
-
+
:param model: model to use to generate joke.
:param promt: (optional) promt for a joke, if not given
generates the whole joke
:return: `Joke` object
"""
if promt:
- print(f'[INFO] continue - Model: {model.name}')
res = self._continue_joke(model, promt)
else:
- res = self._get_joke_from_buffer()
+ res = self._get_new_joke()
res['text'] = self._prettify_result(res['text'])
joke_id = self.store.add_joke(**res)
return Joke(id=joke_id, text=res['text'])
-
+
@synchronized
- def __call_model(self, model, prompt, num_return_sequences):
+ def __call_model(self, model, prompt, n_return_sequences):
"""Call the model to generate the joke.
:param model: model to use
:param promt: prompt for the model
- :param num_return_sequences: number of sequences to generate
- :return: list of (num_return_sequences) dicts
+ :param n_return_sequences: number of sequences to generate
+ :return: list of (n_return_sequences) dicts
with 'text' and 'generated_by' fields
"""
return [{
'generated_by': model.name,
'text': seq
- } for seq in model.generate(prompt, num_return_sequences)]
-
+ } for seq in model.generate(prompt, n_return_sequences)]
+
+ @synchronized
+ def _get_from_buffer(self, model, buffer):
+ """Get a joke from the buffer.
+
+ :param buffer: buffer to get the joke from
+ :param model: model to use to fill the buffer
+ :return: joke from the buffer
+ """
+ if len(buffer) == 0:
+ buffer = self._fill_buffer(model)
+ model.logger.info('Got joke from the buffer')
+ return buffer.pop()
+
+ @synchronized
+ @abstractmethod
+ def _fill_buffer(self, model):
+ """Fill and return the buffer associated with given model.
+ """
+ pass
+
@synchronized
- def _fill_jokes_buffer(self, model):
- """Fill the jokes buffer associated with given model.
- WARNING: Default implementation just returns the result.
+ def _generate_for_buffer(self, model):
+ """Generates jokes to fill the buffer.
:param model: model to use
:return: see `__call_model` function
"""
- return self.__call_model(model, self.default_promt_token,
- num_return_sequences=self.jokes_buffer_size)
-
- @synchronized
+ model.logger.info('Filling the buffer')
+ return self.__call_model(model, self.config['promt_token'],
+ n_return_sequences=self.config['buffer_size'])
+
@abstractmethod
- def _get_joke_from_buffer(self, model):
- """Get the new joke from the buffer for given model.
- If needed update the buffer.
+ def _get_new_joke(self, model):
+ """Get the new joke from the given model.
"""
pass
-
+
@synchronized
def _continue_joke(self, model, promt):
"""Continue the joke given in promt."""
- model_promt = self.custom_promt.format(' ' + promt.strip())
- return self.__call_model(model, model_promt, num_return_sequences=1)[0]
+ model.logger.info('Continue joke')
+ model_promt = self.config['custom_promt'].format(' ' + promt.strip())
+ return self.__call_model(model, model_promt, n_return_sequences=1)[0]
+
class JokeGenerator(AbstractJokeGenerator):
"""Simple Joke generator using one model."""
- def __init__(self, model_path, max_joke_len=40, jokes_buffer_size=16, model_device='cpu'):
- super().__init__(jokes_buffer_size)
+ def __init__(self, model_path, config):
+ super().__init__(config)
model_name = os.path.split(model_path)[1]
- self.model = ModelWrapper(model_path, model_name, max_length=max_joke_len)
- self._fill_jokes_buffer()
+ self.logger = logging.getLogger(type(self).__name__)
+ self.logger.info('Loading model...')
+ self.model = ModelWrapper(model_path, model_name, **config)
+ self.logger.info(f'Loaded {self.model.name}')
+ self._fill_buffer(self.model)
+ self.logger.info('Ready to work!')
- @synchronized
- def _fill_jokes_buffer(self):
- self.jokes_buffer = super()._fill_jokes_buffer(self.model)
+ def generate_joke(self, promt=""):
+ return super().generate_joke(self.model, promt)
@synchronized
- def _get_joke_from_buffer(self):
- if len(self.jokes_buffer) == 0:
- self._fill_jokes_buffer()
- return self.jokes_buffer.pop()
+ def _fill_buffer(self, model):
+ self.jokes_buffer = super()._generate_for_buffer(model)
+ return self.jokes_buffer
- @synchronized
- def generate_joke(self, promt=""):
- return super().generate_joke(self.model, promt)
+ def _get_new_joke(self):
+ return self._get_from_buffer(self.model, self.jokes_buffer)
class TestABGenerator(AbstractJokeGenerator):
"""Joke generator for a/b testing.
Outputs the joke from either of models/datasets.
- Chooses the source randomly."""
- def __init__(self, dataset_paths, model_paths, max_joke_len=40, jokes_buffer_size=16, model_device='cpu'):
+ Chooses the source randomly.
+ """
+
+ def __init__(self, dataset_paths, model_paths, config):
"""
Loads datasets and models. Initiates pools and orders of passing
:param dataset_paths: paths to the dataset
:param model_paths: paths to the model
"""
- super().__init__(jokes_buffer_size)
-
- self.models = list()
- self.key2pool = dict()
- for model_path in model_paths:
- model_name = os.path.split(model_path)[1]
- self.models.append(ModelWrapper(model_path, model_name, max_length=max_joke_len))
- self._fill_jokes_buffer(self.models[-1])
+ super().__init__(config)
+ self.models, self.key2pool = list(), dict()
+ self.logger = logging.getLogger(type(self).__name__)
+ self.logger.info('Loading models...')
+ for m_path in model_paths:
+ m_name = os.path.split(m_path)[1]
+ self.models.append(ModelWrapper(m_path, m_name, **config))
+ self.logger.info(f'Loaded {self.models[-1].name}')
+ self._fill_buffer(self.models[-1])
+
+ self.logger.info('Loading datasets...')
+ self.datasets = list()
+ for d_path in dataset_paths:
+ self.datasets.append(Dataset(d_path, self.config['promt_token'],
+ self.config['answer_token']))
+ self.logger.info(f'Loaded {self.datasets[-1].name}')
+ self.logger.info('Ready to work!')
+ self.n_pools = len(self.models) + len(self.datasets)
- self.datasets = [Dataset(path) for path in dataset_paths]
- self.num_of_pools = len(self.models) + len(self.datasets)
-
- @synchronized
def generate_joke(self, promt=""):
idx = random.randint(0, len(self.models) - 1)
model = self.models[idx]
return super().generate_joke(model, promt)
@synchronized
- def _fill_jokes_buffer(self, model):
- self.key2pool[model.name] = super()._fill_jokes_buffer(model)
+ def _fill_buffer(self, model):
+ self.key2pool[model.name] = super()._generate_for_buffer(model)
+ return self.key2pool[model.name]
- @synchronized
- def _get_joke_from_buffer(self):
- idx = random.randint(0, self.num_of_pools - 1)
- res = {}
+ def _get_new_joke(self):
+ """Get the joke either from the model buffer, or dataset."""
+ idx = random.randint(0, self.n_pools - 1)
if idx < len(self.datasets):
- print(f'[INFO] generate - Dataset: {self.datasets[idx].name}')
return random.choice(self.datasets[idx])
idx = idx - len(self.datasets) - 1
key = self.models[idx].name
- print(f'[INFO] generate - Model: {key}')
- if len(self.key2pool[key]) == 0:
- self._fill_jokes_buffer(self.models[idx])
- return self.key2pool[key].pop()
+ return self._get_from_buffer(self.models[idx],
+ self.key2pool[key])
-class Dataset:
- """Wrapper for the DataFrame to return values similar
- to `AbstractJokeGenerator` output.
- """
+class RussianModelWrapper(AbstractJokeGenerator):
+
+ def __init__(self, eng_model, rus_model_path, config):
+ super().__init__(config)
+ self.eng_model = eng_model
+ config.update({ # Spaces are because of the tokenizer
+ 'promt_token': '[ ВОПРОС]',
+ 'answer_token': '[ ОТВЕТ]',
+ 'custom_promt': '[ ВОПРОС]{}\n[ ОТВЕТ]',
+ 'stop_token': '<| endoftext|>',
+ 'model_type': 'gpt2-yttm',
+ })
+ self.rus_model = JokeGenerator(rus_model_path, config)
- def __init__(self, dataset_path):
- self.name = os.path.split(dataset_path)[1]
- self.data = pd.read_csv(dataset_path)
+ @staticmethod
+ def cyrillic_ratio(text):
+ return sum(map(len, re.findall('[\u0400-\u04FF]*', text))) / len(text)
- def __getitem__(self, idx):
- question = self.data['Question'].iloc[idx].strip()
- answer = self.data['Answer'].iloc[idx].strip()
- text = (JokeGenerator.default_promt_token
- + question + '\n'
- + JokeGenerator.answer_token + ' '
- + answer)
- return {
- 'text': text,
- 'generated_by': self.name,
- }
+ def generate_joke(self, promt=""):
+ """If the ratio of cyrillic symbols if high enough,
+ generate using russian model."""
+ if promt.strip() == '/шутка':
+ return self.rus_model.generate_joke('')
+ if promt and self.cyrillic_ratio(promt) > 0.4:
+ return self.rus_model.generate_joke(promt)
+ return self.eng_model.generate_joke(promt)
+
+ @synchronized
+ def _fill_buffer(self, model):
+ assert False, "Unreachable code. _fill_buffer function in RussianModelWrapper."
- def __len__(self):
- return len(self.data)
+ def _get_new_joke(self):
+ """Get the joke from the english model."""
+ return self.eng_model._get_new_joke()
if __name__ == '__main__':
+ logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+ level=logging.INFO)
- datasets = ["data/qa_jokes.csv"]
- models = ["train/models/output_6"]
+ datasets = ["../data/qa_jokes.csv"]
+ models = ["../train/models/output_8"]
gen = TestABGenerator(datasets, models)
diff --git a/bot/main_bot.py b/bot/main_bot.py
index 979fc30..a03c5ef 100644
--- a/bot/main_bot.py
+++ b/bot/main_bot.py
@@ -1,19 +1,23 @@
import logging
import os
+import sys
from functools import wraps
from configparser import ConfigParser
import telegram
-from joke_generator import JokeGenerator, TestABGenerator
+from joke_generator import JokeGenerator, TestABGenerator, RussianModelWrapper
from telegram import InlineKeyboardButton, InlineKeyboardMarkup, ChatAction
from telegram.ext import Updater, CommandHandler, CallbackQueryHandler, MessageHandler, Filters
"""
Basic example for a bot that uses inline keyboards.
"""
-
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
- level=logging.INFO)
+ level=logging.INFO,
+ handlers=[
+ logging.FileHandler("run.log"),
+ logging.StreamHandler(sys.stdout),
+ ])
logger = logging.getLogger(__name__)
cfg = ConfigParser()
@@ -23,22 +27,39 @@
model_paths = model_cfg['model_paths'].split(',')
dataset_paths = model_cfg['dataset_paths'].split(',')
model_args = {
- 'max_joke_len': int(model_cfg['max_joke_len']),
- 'jokes_buffer_size': int(model_cfg['buffer_size']),
- 'model_device': model_cfg['device']
+ 'max_len': int(model_cfg['max_joke_len']),
+ 'buffer_size': int(model_cfg['buffer_size']),
+ 'device': model_cfg['device']
}
if cfg['bot']['ab_test'].lower() == 'true':
joke_generator = TestABGenerator(dataset_paths=dataset_paths,
- model_paths=model_paths,
- **model_args
- )
+ model_paths=model_paths,
+ config=model_args
+ )
else:
- joke_generator = JokeGenerator(model_path=model_paths[0], **model_args)
+ joke_generator = JokeGenerator(model_path=model_paths[0], config=model_args)
+
+if model_cfg.get('rus_model_path'):
+ joke_generator = RussianModelWrapper(eng_model=joke_generator,
+ rus_model_path=model_cfg.get('rus_model_path'),
+ config=model_args
+ )
splitter = "::"
pos = "1"
neg = "2"
+GREETING_MESSAGE = "Welcome to the _Joke Generator Bot_."
+
+HELP_MESSAGE = "Use `/joke` to generate a joke. " + \
+ "Or, if you want a joke on some specific topic from me, " + \
+ "just write me a question and I'll answer it in a playful form." + \
+ "\n\nTo help me learn, please sent feedback on jokes through the 👍/👎 buttons."
+
+DISCLAIMER_MESSAGE = "*DISCLAIMER*: This bot is still very dumb and " + \
+ "produces a lot of dark and racist humor. " + \
+ "Don't judge him, he learned them from the people"
+
def send_typing_action(func):
"""Sends typing action while processing func command."""
@@ -71,9 +92,8 @@ def general_joke_handler(update, context, promt_text=""):
InlineKeyboardButton("👎", callback_data=f'{joke_id}{splitter}{neg}')]]
reply_markup = InlineKeyboardMarkup(keyboard)
-
- update.message.reply_text(
- joke.text, reply_markup=reply_markup, parse_mode=telegram.ParseMode.HTML)
+ update.message.reply_text(joke.text, reply_markup=reply_markup,
+ parse_mode=telegram.ParseMode.MARKDOWN)
def button_handler(update, context):
@@ -88,8 +108,15 @@ def button_handler(update, context):
context.bot.answer_callback_query(query.id, "Thank you for your feedback")
-def start(update, context):
- update.message.reply_text("Use /joke to generate a joke")
+def start_handler(update, context):
+ update.message.reply_text(GREETING_MESSAGE + '\n\n'
+ + HELP_MESSAGE + '\n\n' +
+ DISCLAIMER_MESSAGE, parse_mode=telegram.ParseMode.MARKDOWN)
+
+
+def help_handler(update, context):
+ update.message.reply_text(HELP_MESSAGE + '\n\n' + DISCLAIMER_MESSAGE,
+ parse_mode=telegram.ParseMode.MARKDOWN)
def error(update, context):
@@ -105,7 +132,8 @@ def main():
updater.dispatcher.add_handler(CommandHandler('joke', joke_command_handler))
updater.dispatcher.add_handler(CallbackQueryHandler(button_handler))
- updater.dispatcher.add_handler(CommandHandler('start', start))
+ updater.dispatcher.add_handler(CommandHandler('start', start_handler))
+ updater.dispatcher.add_handler(CommandHandler('help', help_handler))
updater.dispatcher.add_handler(CallbackQueryHandler(text_handler))
updater.dispatcher.add_error_handler(error)
updater.dispatcher.add_handler(MessageHandler(Filters.text, text_handler))
diff --git a/bot/storage.py b/bot/storage.py
index 4e18e73..780754e 100644
--- a/bot/storage.py
+++ b/bot/storage.py
@@ -5,6 +5,7 @@
db = SqliteDatabase('jokes.db')
default_generated = "unknown"
+
class BaseModel(Model):
class Meta:
database = db
@@ -38,7 +39,8 @@ def add_joke(text, generated_by=default_generated):
def add_or_update_vote(joke_id, user_id, rating):
# https://stackoverflow.com/questions/33485312/insert-or-update-a-peewee-record-in-python
- Vote.insert(joke_id=joke_id, user_id=user_id, rate=rating).on_conflict('replace').execute()
+ Vote.insert(joke_id=joke_id, user_id=user_id,
+ rate=rating).on_conflict('replace').execute()
db.connect()
diff --git a/bot/yt_encoder.py b/bot/yt_encoder.py
new file mode 100644
index 0000000..c24286b
--- /dev/null
+++ b/bot/yt_encoder.py
@@ -0,0 +1,55 @@
+"""Byte pair encoding utilities"""
+import os
+import youtokentome as yttm
+import hashlib
+from transformers.tokenization_utils import PreTrainedTokenizer
+import shutil
+import regex as re
+from os.path import samefile
+
+NEW_LINE = '<|n|>'
+
+class YTEncoder(PreTrainedTokenizer):
+ def_name = 'encoder.model'
+ def __init__(self, filename, *inputs, **kwargs):
+ super().__init__(*inputs, **kwargs)
+
+ if os.path.isdir(filename): filename = os.path.join(filename, self.def_name)
+
+ self.bpe = yttm.BPE(filename)
+ self.hash = hashlib.sha512(open(filename, 'rb').read()).hexdigest()[:10]
+ self.filename = filename
+
+ def encode(self, text, **kwargs):
+ if text and text[0] != ' ': text = ' ' + text
+ text = re.sub(r'(?=[^ ])([\W])([\w])',r'\g<1> \g<2>',text)
+ text = text.replace('\n', f' {NEW_LINE} ')
+
+ return self.bpe.encode([text], output_type=yttm.OutputType.ID)[0]
+
+
+ def decode(self, tokens, **kwargs): # I hate regexps
+ if not isinstance(tokens, list):
+ tokens = tokens.tolist()
+ result = self.bpe.decode(tokens)[0]
+ result = re.sub(r'( )?(<\|n\|>)( )?', r'\n', result)
+ result = re.sub(r'([\n(]) (\w)', r'\g<1>\g<2>', result)
+ result = re.sub(r'(\W)([«"''\n(]|^) (\w)', r'\g<1>\g<2>\g<3>', result)
+ result = re.sub(r'(\w)- (\w)', r'\g<1>-\g<2>', result)
+ return result
+
+ def tokenize(self, text, **kwargs):
+ return self.encode(text)
+
+ @classmethod
+ def from_pretrained(cls, *inputs, **kwargs):
+ return cls(*inputs, **kwargs)
+
+ def add_special_tokens_single_sentence(self, token_ids):
+ return token_ids
+
+ def save_pretrained(self, save_directory):
+ src = self.filename
+ dst = os.path.join(save_directory, self.def_name)
+ if src != dst:
+ shutil.copyfile(src, dst)
diff --git a/data/short_jokes.csv b/data/short_jokes.csv
index 19ecf03..b18d1c2 100644
--- a/data/short_jokes.csv
+++ b/data/short_jokes.csv
@@ -152006,7 +152006,7 @@
152476,"What do you call 4 mexicans in quick sand? Cuatro sinko"
152477,"What did the pirate say when he saw the dank meme? arrr lmao"
152478,"What did Pitbull ask for Christmas? Dolly."
-152479,"How many Karma whores does it take to change a lightbulb? I don't care.
+152479,"How many Karma whores does it take to change a lightbulb? I don't care."
152480,"A man walks into a bar and says... ""Argh, fuck!"""
152481,"I know a great knock knock joke But you have to start it."
152482,"What do you call a group of transsexual surfers? The radical left."
diff --git a/images/answers.png b/images/answers.png
new file mode 100644
index 0000000..1d95067
Binary files /dev/null and b/images/answers.png differ
diff --git a/images/greeting.png b/images/greeting.png
new file mode 100644
index 0000000..f0132ea
Binary files /dev/null and b/images/greeting.png differ
diff --git a/report.pdf b/report.pdf
new file mode 100644
index 0000000..9d45c3b
Binary files /dev/null and b/report.pdf differ
diff --git a/requirements.txt b/requirements.txt
index f4060ed..60c1e16 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,4 +2,5 @@ numpy
pandas
python-telegram-bot
peewee
-transformers
\ No newline at end of file
+youtokentome
+transformers
diff --git a/train/GPT-2 train helper.ipynb b/train/GPT-2 train helper.ipynb
index bebe7b5..e0d3890 100644
--- a/train/GPT-2 train helper.ipynb
+++ b/train/GPT-2 train helper.ipynb
@@ -2,26 +2,71 @@
"cells": [
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 5,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorboard\\compat\\tensorflow_stub\\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorboard\\compat\\tensorflow_stub\\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorboard\\compat\\tensorflow_stub\\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorboard\\compat\\tensorflow_stub\\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorboard\\compat\\tensorflow_stub\\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n",
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\tensorboard\\compat\\tensorflow_stub\\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+ " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n"
+ ]
+ }
+ ],
"source": [
"import os\n",
"import json\n",
"import random\n",
"from glob import glob\n",
"import re\n",
+ "from tqdm.notebook import tqdm\n",
"import numpy as np\n",
"import pandas as pd\n",
"import transformers\n",
"\n",
+ "# Eng\n",
+ "\n",
"qa_jokes_filepath = os.path.join('..', 'data', 'qa_jokes.csv')\n",
"short_jokes_filepath = os.path.join('..', 'data', 'short_jokes.csv')\n",
"transcripts_path = os.path.join('..', 'data', 'transcripts')\n",
"\n",
"qa_jokes_prep_outpath = os.path.join('..', 'data', 'prep', 'qa_jokes_gpt2.txt')\n",
"short_jokes_prep_outpath = os.path.join('..', 'data', 'prep', 'short_jokes_gpt2.txt')\n",
- "transcripts_prep_outpath = os.path.join('..', 'data', 'prep', 'transcripts_gpt2.txt')"
+ "transcripts_prep_outpath = os.path.join('..', 'data', 'prep', 'transcripts_gpt2.txt')\n",
+ "\n",
+ "# Rus\n",
+ "\n",
+ "rus_qa_jokes_filepath = os.path.join('..', 'data', 'rus_qa_jokes.csv')\n",
+ "rus_jokes_filepath = os.path.join('..', 'data', 'rus_jokes.csv')\n",
+ "rus_stories_filepath = os.path.join('..', 'data', 'anekdot_stories.csv')\n",
+ "\n",
+ "\n",
+ "rus_qa_jokes_prep_outpath = os.path.join('..', 'data', 'prep', 'rus_qa_jokes_gpt2.txt')\n",
+ "rus_jokes_prep_outpath = os.path.join('..', 'data', 'prep', 'rus_jokes_gpt2.txt')\n",
+ "rus_stories_prep_outpath = os.path.join('..', 'data', 'prep', 'rus_stories_gpt2.txt')"
]
},
{
@@ -33,17 +78,29 @@
},
{
"cell_type": "code",
- "execution_count": 2,
+ "execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def fix_encoding(s):\n",
" \"\"\"Skip characters that can't be encoded by standard encoder.\"\"\"\n",
- " return s.encode('latin1', 'ignore').decode('utf8', 'ignore')\n",
+ " return s.encode('utf-8', 'ignore').decode('utf8', 'ignore')\n",
"\n",
- "def write_to_file(file_path, text):\n",
+ "# TODO: Add > <\n",
+ "regexps = [ # Regexp for the special chars\n",
+ " (re.compile('♦'), '*'),\n",
+ " (re.compile('\\n *\\n'), '\\n'), # Replace multiple newlines with one\n",
+ " (re.compile(r' {2,}'), ' '), # Replace multiple spaces with one\n",
+ "]\n",
+ "\n",
+ "def fix_text(s):\n",
+ " for regexp in regexps:\n",
+ " s = regexp[0].sub(regexp[1], s)\n",
+ " return fix_encoding(s.strip())\n",
+ "\n",
+ "def write_to_file(file_path, text, encoding=None):\n",
" os.makedirs(os.path.dirname(file_path), exist_ok=True)\n",
- " with open(file_path, 'w') as out_file:\n",
+ " with open(file_path, 'w', encoding=encoding) as out_file:\n",
" out_file.write(text)\n",
"\n",
"START_DOC_TOKEN = ''\n",
@@ -62,7 +119,7 @@
},
{
"cell_type": "code",
- "execution_count": 51,
+ "execution_count": 3,
"metadata": {},
"outputs": [
{
@@ -70,13 +127,13 @@
"output_type": "stream",
"text": [
"\n",
- "RangeIndex: 38233 entries, 0 to 38232\n",
+ "RangeIndex: 38232 entries, 0 to 38231\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
- " 0 ID 38233 non-null int64 \n",
- " 1 Question 38233 non-null object\n",
- " 2 Answer 38233 non-null object\n",
+ " 0 ID 38232 non-null int64 \n",
+ " 1 Question 38232 non-null object\n",
+ " 2 Answer 38232 non-null object\n",
"dtypes: int64(1), object(2)\n",
"memory usage: 896.2+ KB\n"
]
@@ -89,15 +146,17 @@
},
{
"cell_type": "code",
- "execution_count": 52,
+ "execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
- "qa_corpus = ''\n",
+ "qa_corpus = []\n",
"for _, question, answer in qa_jokes.values:\n",
- " qa_corpus += fix_encoding(f'{START_DOC_TOKEN}[QUESTION] {question}\\n[ANSWER] {answer}\\n{END_DOC_TOKEN}\\n')\n",
+ " qa_corpus.append(f'{START_DOC_TOKEN}[QUESTION] {question}\\n[ANSWER] {answer}\\n{END_DOC_TOKEN}')\n",
"\n",
- "write_to_file(qa_jokes_prep_outpath, qa_corpus)"
+ "qa_corpus = '\\n'.join(map(lambda s: fix_text(s), qa_corpus))\n",
+ "\n",
+ "write_to_file(qa_jokes_prep_outpath, qa_corpus, encoding='utf-8')"
]
},
{
@@ -110,40 +169,22 @@
},
{
"cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Additional cleaning\n",
- "from glob import glob\n",
- "import re\n",
- "for f_path in glob(os.path.join(transcripts_path,'*')):\n",
- " with open(f_path, encoding='utf8') as f:\n",
- " text = ' '.join(f.readlines())\n",
- " text = re.sub('\\n *\\n', '\\n', text)\n",
- " text = re.sub(r' {2,}', ' ', text)\n",
- " if len(re.findall(r'html|http|jpe?g|png|mp4', text)) > 0:\n",
- " print('Has http|html in it:', f_path)\n",
- " with open(f_path, 'w', encoding='utf8') as f:\n",
- " f.write(text)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 115,
+ "execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
- "transcript_corpus = ''\n",
+ "transcript_corpus = []\n",
"# Load transcripts.\n",
"for file_path in glob(os.path.join(transcripts_path, '*')):\n",
" with open(file_path, 'r', encoding='utf8') as in_file:\n",
- " transcript_corpus += START_DOC_TOKEN + ''.join(in_file.read()) + END_DOC_TOKEN + '\\n'\n",
+ " if len(re.findall(r'html|http|jpe?g|png|mp4', text)) > 0:\n",
+ " print('Has http|html in it:', f_path)\n",
+ " transcript_corpus.append(START_DOC_TOKEN + ''.join(in_file.read()) + END_DOC_TOKEN)\n",
"\n",
- "transcript_corpus = fix_encoding(fix_encoding(transcript_corpus))\n",
+ "transcript_corpus = '\\n'.join(map(lambda s: fix_text(s), transcript_corpus))\n",
"\n",
"# Save all them as dataset.\n",
- "write_to_file(transcripts_prep_outpath, transcript_corpus)"
+ "write_to_file(transcripts_prep_outpath, transcript_corpus, encoding='utf-8')"
]
},
{
@@ -158,7 +199,7 @@
},
{
"cell_type": "code",
- "execution_count": 54,
+ "execution_count": 4,
"metadata": {},
"outputs": [
{
@@ -166,12 +207,12 @@
"output_type": "stream",
"text": [
"\n",
- "RangeIndex: 230975 entries, 0 to 230974\n",
+ "RangeIndex: 230974 entries, 0 to 230973\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
- " 0 ID 230975 non-null int64 \n",
- " 1 Joke 230975 non-null object\n",
+ " 0 ID 230974 non-null int64 \n",
+ " 1 Joke 230974 non-null object\n",
"dtypes: int64(1), object(1)\n",
"memory usage: 3.5+ MB\n"
]
@@ -184,7 +225,7 @@
},
{
"cell_type": "code",
- "execution_count": 56,
+ "execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
@@ -199,16 +240,16 @@
},
{
"cell_type": "code",
- "execution_count": 61,
+ "execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "\"What's the top song by the Vietnamese Beatles? Rice Fields Forever.\""
+ "'What do you call a Muslim organization that rejects Muhammed? A non-prophet'"
]
},
- "execution_count": 61,
+ "execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@@ -221,7 +262,20 @@
},
{
"cell_type": "code",
- "execution_count": 62,
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "jokes = {'Question': [], 'Answer': []}\n",
+ "for i, joke in short_jokes.iloc[qa_jokes_in_short_jokes[ind]].values:\n",
+ " sentences = sent_tokenize(joke.strip())\n",
+ " question, answer = sentences[0], ' '.join(sentences[1:])\n",
+ " jokes['']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
@@ -236,7 +290,7 @@
},
{
"cell_type": "code",
- "execution_count": 63,
+ "execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
@@ -246,7 +300,7 @@
},
{
"cell_type": "code",
- "execution_count": 64,
+ "execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
@@ -258,6 +312,249 @@
"write_to_file(short_jokes_prep_outpath, short_jokes_corpus)"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Russian Stories"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 149,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Int64Index: 110529 entries, 0 to 110528\n",
+ "Data columns (total 1 columns):\n",
+ " # Column Non-Null Count Dtype \n",
+ "--- ------ -------------- ----- \n",
+ " 0 Text 110529 non-null object\n",
+ "dtypes: object(1)\n",
+ "memory usage: 1.7+ MB\n"
+ ]
+ }
+ ],
+ "source": [
+ "rus_stories = pd.read_csv(rus_stories_filepath, index_col=0)\n",
+ "rus_stories.info()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 155,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rus_stories_corpus = '\\n'.join(map(lambda s: START_DOC_TOKEN + fix_text(s[0]) + END_DOC_TOKEN, rus_stories.values))\n",
+ "\n",
+ "# Save all them as dataset.\n",
+ "write_to_file(rus_stories_prep_outpath, rus_stories_corpus, encoding='utf-8')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Russian Jokes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 151,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Int64Index: 439057 entries, 0 to 439056\n",
+ "Data columns (total 1 columns):\n",
+ " # Column Non-Null Count Dtype \n",
+ "--- ------ -------------- ----- \n",
+ " 0 Text 439057 non-null object\n",
+ "dtypes: object(1)\n",
+ "memory usage: 6.7+ MB\n"
+ ]
+ }
+ ],
+ "source": [
+ "rus_jokes = pd.read_csv(rus_jokes_filepath, index_col=0)\n",
+ "rus_jokes.info()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 152,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Text | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " Один хороший анекдот — это дополнительные 15 м... | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " Старик и старуха в суде. Судья: — Почему разво... | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " Х♦♦ — железо, пока горячо. | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " — Сколько нужно вагонов, чтобы вывезти всех де... | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " Женщина: в 20 лет — лепестки розы в 30 лет — с... | \n",
+ "
\n",
+ " \n",
+ " | ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " | 439052 | \n",
+ " Вот было бы так: поел, завалился на диван, сп... | \n",
+ "
\n",
+ " \n",
+ " | 439053 | \n",
+ " Мишустин доездился без пропуска - вот и заболе... | \n",
+ "
\n",
+ " \n",
+ " | 439054 | \n",
+ " Напоследок Мишустин из больницы пообещал сдела... | \n",
+ "
\n",
+ " \n",
+ " | 439055 | \n",
+ " План следующего выступления: 1.Ситуация с коро... | \n",
+ "
\n",
+ " \n",
+ " | 439056 | \n",
+ " - Видел обращение Путина? - В кого?! | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
439057 rows × 1 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Text\n",
+ "0 Один хороший анекдот — это дополнительные 15 м...\n",
+ "1 Старик и старуха в суде. Судья: — Почему разво...\n",
+ "2 Х♦♦ — железо, пока горячо.\n",
+ "3 — Сколько нужно вагонов, чтобы вывезти всех де...\n",
+ "4 Женщина: в 20 лет — лепестки розы в 30 лет — с...\n",
+ "... ...\n",
+ "439052 Вот было бы так: поел, завалился на диван, сп...\n",
+ "439053 Мишустин доездился без пропуска - вот и заболе...\n",
+ "439054 Напоследок Мишустин из больницы пообещал сдела...\n",
+ "439055 План следующего выступления: 1.Ситуация с коро...\n",
+ "439056 - Видел обращение Путина? - В кого?!\n",
+ "\n",
+ "[439057 rows x 1 columns]"
+ ]
+ },
+ "execution_count": 152,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rus_jokes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 156,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rus_jokes_corpus = '\\n'.join(map(lambda s: START_DOC_TOKEN + fix_text(s[0]) + END_DOC_TOKEN, rus_jokes.values))\n",
+ "\n",
+ "\n",
+ "# # Save all them as dataset.\n",
+ "write_to_file(rus_jokes_prep_outpath, rus_jokes_corpus, encoding='utf-8')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Russian QA Jokes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 135,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "RangeIndex: 67563 entries, 0 to 67562\n",
+ "Data columns (total 3 columns):\n",
+ " # Column Non-Null Count Dtype \n",
+ "--- ------ -------------- ----- \n",
+ " 0 Unnamed: 0 67563 non-null int64 \n",
+ " 1 Question 67563 non-null object\n",
+ " 2 Answer 67563 non-null object\n",
+ "dtypes: int64(1), object(2)\n",
+ "memory usage: 1.5+ MB\n"
+ ]
+ }
+ ],
+ "source": [
+ "rus_qa_jokes = pd.read_csv(rus_qa_jokes_filepath)\n",
+ "rus_qa_jokes.info()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 136,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rus_qa_corpus = []\n",
+ "for _, question, answer in rus_qa_jokes.values:\n",
+ " rus_qa_corpus.append(fix_text(f'{START_DOC_TOKEN}[ ВОПРОС] {question}\\n[ ОТВЕТ] {answer}\\n{END_DOC_TOKEN}'))\n",
+ "\n",
+ "rus_qa_corpus = '\\n'.join(rus_qa_corpus)\n",
+ "\n",
+ "write_to_file(rus_qa_jokes_prep_outpath, rus_qa_corpus, encoding='utf-8')"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -267,26 +564,33 @@
},
{
"cell_type": "code",
- "execution_count": 51,
+ "execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"def create_cmd_command(python_path, script, kwargs, flags):\n",
" args = ' '.join(f'--{k}={v}' for k, v in kwargs.items())\n",
" args += ' ' + ' '.join(f'--{f}' for f in flags)\n",
- " return f'{python_path} {script} {args}'\n",
- "\n",
+ " return f'{python_path} {script} {args}'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
"python_path = r'C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe'\n",
- "script = r'run_language_modeling.py'\n",
+ "script = r'run_lm_finetuning.py'\n",
"train_kwargs = {\n",
" 'model_type': 'gpt2', # gpt2, ctrl, openai-gpt, xlnet, transfo-xl, xlm\n",
" 'model_name_or_path':'gpt2',\n",
" 'output_dir':'output',\n",
" 'block_size': 512,\n",
" 'learning_rate': 1e-6,\n",
- " 'num_train_epochs': 10,\n",
+ " 'num_train_epochs': 3,\n",
" 'per_gpu_train_batch_size': 2,\n",
- " 'gradient_accumulation_steps': 4,\n",
+ " 'gradient_accumulation_steps': 8,\n",
" 'save_steps': 1000,\n",
"# 'max_steps': 20000,\n",
"}\n",
@@ -302,21 +606,21 @@
" 'output_4', # short_jokes 1e-7, 2 grad_acc 2\n",
" 'output_5', # qa_jokes 1e-5, 3 grad_acc 8 - most funny yet\n",
" 'output_6', # qa_jokes 1e-5, 3 grad_acc 4\n",
- " 'output_7', # qa_jokes 1e-6, 2 grad_acc 2 - ?\n",
- " 'output_8', # qa_jokes 1e-6, 10 grad_acc 8 ###\n",
+ " 'output_7', # qa_jokes 1e-6, 2 grad_acc 2\n",
+ " 'output_8', # qa_jokes 1e-6, 10 grad_acc 8\n",
" \n",
"]\n",
"\n",
"train_flags = [\n",
" 'do_train',\n",
- "# 'overwrite_output_dir',\n",
+ " 'overwrite_output_dir',\n",
"# 'fp16',\n",
"]"
]
},
{
"cell_type": "code",
- "execution_count": 52,
+ "execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
@@ -327,28 +631,28 @@
},
{
"cell_type": "code",
- "execution_count": 53,
+ "execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "From: models\\output_7 \n",
- "To: models\\output_8\n"
+ "From: gpt2 \n",
+ "To: models\\output\n"
]
}
],
"source": [
- "# train_kwargs['model_name_or_path'] = train_outputs[0]\n",
- "train_kwargs['model_name_or_path'] = os.path.join('models', train_outputs[8])\n",
- "train_kwargs['output_dir'] = os.path.join('models', train_outputs[9])\n",
+ "train_kwargs['model_name_or_path'] = train_outputs[0]\n",
+ "# train_kwargs['model_name_or_path'] = os.path.join('models', train_outputs[0])\n",
+ "train_kwargs['output_dir'] = os.path.join('models', train_outputs[1])\n",
"print('From:', train_kwargs['model_name_or_path'], '\\nTo:', train_kwargs['output_dir'])"
]
},
{
"cell_type": "code",
- "execution_count": 54,
+ "execution_count": 20,
"metadata": {
"scrolled": true
},
@@ -357,7 +661,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe run_language_modeling.py --model_type=gpt2 --model_name_or_path=models\\output_7 --output_dir=models\\output_8 --block_size=512 --learning_rate=1e-06 --num_train_epochs=10 --per_gpu_train_batch_size=2 --gradient_accumulation_steps=4 --save_steps=1000 --train_data_file=..\\data\\prep\\qa_jokes_gpt2.txt --do_train\n"
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe ru_transformers-master\\run_lm_finetuning.py --model_type=gpt2 --model_name_or_path=gpt2 --output_dir=models\\output --block_size=512 --learning_rate=1e-06 --num_train_epochs=3 --per_gpu_train_batch_size=2 --gradient_accumulation_steps=8 --save_steps=1000 --train_data_file=..\\data\\prep\\qa_jokes_gpt2.txt --do_train --overwrite_output_dir\n"
]
}
],
@@ -375,14 +679,14 @@
},
{
"cell_type": "code",
- "execution_count": 59,
+ "execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe run_generation.py --model_type=gpt2 --model_name_or_path=models\\output_8 --prompt=\"[QUESTION]\" --length=100 --stop_token=\"<|endoftext|>\" --temperature=0.9 --repetition_penalty=1.05 --k=50 --p=0.95 --num_return_sequences=40 \n"
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe run_generation.py --model_type=gpt2 --model_name_or_path=models\\rus_2 --prompt=\"[QUESTION]\" --length=100 --stop_token=\"<|endoftext|>\" --temperature=0.9 --repetition_penalty=1.05 --k=50 --p=0.95 --num_return_sequences=40 \n"
]
}
],
@@ -408,6 +712,170 @@
"print(cmd_command)"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## ru_transformers\n",
+ "https://github.com/mgrankin/ru_transformers"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 177,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "python_path = r'C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe'\n",
+ "script = r'run_lm_finetuning.py'\n",
+ "train_kwargs = {\n",
+ " 'model_type': 'gpt2-yttm', # gpt2, ctrl, openai-gpt, xlnet, transfo-xl, xlm\n",
+ " 'model_name_or_path':'gpt2',\n",
+ " 'output_dir':'output',\n",
+ " 'block_size': 512,\n",
+ " 'learning_rate': 5e-7,\n",
+ " 'num_train_epochs': 5,\n",
+ " 'per_gpu_train_batch_size': 2,\n",
+ " 'gradient_accumulation_steps': 16,\n",
+ " 'save_steps': 1000,\n",
+ " 'save_total_limit': 3,\n",
+ " 'logging_steps': 10,\n",
+ " 'warmup_samples': 500,\n",
+ " 'unfreeze_level': -1,\n",
+ "# 'max_steps': 20000,\n",
+ "}\n",
+ "\n",
+ "train_outputs = [\n",
+ " r'ru_gpt2/s_checkpoint-1900000', # 'ru_gpt2\\m_checkpoint-3364613'\n",
+ " 'rus_test_1', # rus_jokes 1 1e-4 16 unfreeze 0\n",
+ " 'rus_test_2', # rus_jokes 1 5e-5 16 unfreeze 1\n",
+ " 'rus_test_3', # rus_jokes 1 5e-5 16 unfreeze 2\n",
+ " 'rus_test_4', # rus_jokes 1 1e-4 16 unfreeze 7\n",
+ " 'rus_test_5', # rus_jokes 2 5e-6 16 unfreeze -1\n",
+ " 'rus_test_6', # rus_qa_jokes \n",
+ " 'rus_test_7', # rus_qa_jokes\n",
+ "]\n",
+ "\n",
+ "\n",
+ "train_flags = [\n",
+ " 'do_train',\n",
+ " 'overwrite_output_dir',\n",
+ " 'lr_decay',\n",
+ "# 'fp16',\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 178,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "..\\data\\prep\\rus_qa_jokes_gpt2.txt\n"
+ ]
+ }
+ ],
+ "source": [
+ "# train_kwargs['train_data_file'] = rus_stories_prep_outpath\n",
+ "# train_kwargs['train_data_file'] = rus_jokes_prep_outpath\n",
+ "train_kwargs['train_data_file'] = rus_qa_jokes_prep_outpath\n",
+ "print(train_kwargs['train_data_file'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 179,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "From: models\\rus_test_6 \n",
+ "To: models\\rus_test_7\n"
+ ]
+ }
+ ],
+ "source": [
+ "# train_kwargs['model_name_or_path'] = train_outputs[0]\n",
+ "train_kwargs['model_name_or_path'] = os.path.join('models', train_outputs[6])\n",
+ "\n",
+ "# train_kwargs['tokenizer_name'] = train_kwargs['tokenizer_name'].format(train_kwargs['model_name_or_path'])\n",
+ "train_kwargs['output_dir'] = os.path.join('models', train_outputs[7])\n",
+ "print('From:', train_kwargs['model_name_or_path'], '\\nTo:', train_kwargs['output_dir'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 180,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe ru_transformers-master\\run_lm_finetuning.py --model_type=gpt2-yttm --model_name_or_path=models\\rus_test_6 --output_dir=models\\rus_test_7 --block_size=512 --learning_rate=5e-07 --num_train_epochs=5 --per_gpu_train_batch_size=2 --gradient_accumulation_steps=16 --save_steps=1000 --save_total_limit=3 --logging_steps=10 --warmup_samples=500 --unfreeze_level=-1 --train_data_file=..\\data\\prep\\rus_qa_jokes_gpt2.txt --do_train --overwrite_output_dir --lr_decay\n"
+ ]
+ }
+ ],
+ "source": [
+ "cmd_command = create_cmd_command(python_path, script, train_kwargs, train_flags)\n",
+ "print(cmd_command)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " --do_eval \\\n",
+ " --evaluate_during_training \\\n",
+ " --eval_steps 1000 \\\n",
+ " --eval_data_file=./data/classic/valid \\"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Generate"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\python.exe ru_transformers-master\\run_generation.py --model_type=gpt2-yttm --model_name_or_path=models\\output --length=200 --temperature=0.9 --stop_token=\"<|endoftext|>\" --top_k=50 --top_p=0.95 --num_return_sequences=20 \n"
+ ]
+ }
+ ],
+ "source": [
+ "gen_script = r'run_generation.py'\n",
+ "generate_kwargs = {\n",
+ " 'model_type': 'gpt2-yttm',\n",
+ " 'model_name_or_path': train_kwargs['output_dir'],\n",
+ " 'length': 200,\n",
+ " 'temperature': 0.9, # temperature of 1.0 has no effect, lower tend toward greedy sampling\n",
+ " 'stop_token': f'\"{END_DOC_TOKEN}\"',\n",
+ " 'top_k': 50,\n",
+ " 'top_p': 0.95,\n",
+ " 'num_return_sequences':20,\n",
+ "}\n",
+ "gen_flags = []\n",
+ "\n",
+ "cmd_command = create_cmd_command(python_path, gen_script, generate_kwargs, gen_flags)\n",
+ "print(cmd_command)"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
diff --git a/train/Web scraper.ipynb b/train/Web scraper.ipynb
index 6f61154..8c13d98 100644
--- a/train/Web scraper.ipynb
+++ b/train/Web scraper.ipynb
@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@@ -13,6 +13,7 @@
"from bs4.element import Tag\n",
"import re\n",
"import os\n",
+ "import pandas as pd\n",
"from tqdm.notebook import tqdm\n",
"import socket\n",
"\n",
@@ -21,19 +22,92 @@
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
+ "headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}\n",
+ "\n",
"regexps = [\n",
" (re.compile('♪[^♪]*♪|\\[[^\\]]*\\]|\\([^\\)]*\\)'), ' '),\n",
+ " (re.compile('<\\/br>'), '\\n'),\n",
" (re.compile('<\\/?[\\w ]*>'), ' '), # for <\\br> and similar tags\n",
"]"
]
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def save_file(path, txt, encoding=None):\n",
+ " # Create the corresponding folder (if needed)\n",
+ " os.makedirs(os.path.dirname(path), exist_ok=True)\n",
+ " with open(path, 'w', encoding=encoding) as out_file:\n",
+ " out_file.writelines(txt)\n",
+ "\n",
+ "\n",
+ "def process_block(block):\n",
+ " result = []\n",
+ " # If not text, skip\n",
+ " try:\n",
+ " if block.name in ['img'] or (block.name == 'div' and block.get('class', [''])[0] == 'yarpp-related'):\n",
+ " return result\n",
+ " except AttributeError:\n",
+ " print('[ERROR] AttributeError!')\n",
+ " print(type(block))\n",
+ " print(block)\n",
+ " print('---------------------------------')\n",
+ " # If is a tag, process it's content\n",
+ " if isinstance(block, Tag):\n",
+ " for sub_block in block.contents:\n",
+ " result.extend(process_block(sub_block))\n",
+ " return result\n",
+ " for regexp, sub_str in regexps:\n",
+ " block = regexp.sub(sub_str, block)\n",
+ " block = block.strip()\n",
+ " if block:\n",
+ " result.append(block)\n",
+ " return result\n",
+ "\n",
+ "pages_to_skip = []\n",
+ "\n",
+ "def scrap_page(func, urls_gen, n_batches, save_to_files=False):\n",
+ " n_processed = 0\n",
+ " pbar = tqdm(total=n_batches)\n",
+ " accumulator = []\n",
+ " for i, url in enumerate(urls_gen(n_batches, pbar)):\n",
+ " # Skip \"bad\" pages\n",
+ " if url in pages_to_skip:\n",
+ " continue\n",
+ " try:\n",
+ " pbar.set_description('Loading {} page...'.format(i+1))\n",
+ " page = requests.get(url)\n",
+ " pbar.set_description('Processing {} page...'.format(i+1))\n",
+ " except requests.exceptions.ConnectionError:\n",
+ " print('[ERROR] Connection error to', url)\n",
+ " pbar.update(1)\n",
+ " continue\n",
+ " soup = BeautifulSoup(page.content, 'html.parser')\n",
+ " new_items = func(soup, url)\n",
+ " accumulator.extend(new_items)\n",
+ " n_processed += len(new_items)\n",
+ " print('[INFO] Processed another ', len(new_items), 'items, for a total of', n_processed)\n",
+ " pbar.update(1)\n",
+ " return accumulator"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## scrapsfromtheloft.com"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 174,
"metadata": {},
"outputs": [],
"source": [
@@ -79,13 +153,13 @@
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 179,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
- "model_id": "673ecaa8ed484a79b1d0962f361a5b95",
+ "model_id": "defc52907dd54ba3866d3ff288e2f44b",
"version_major": 2,
"version_minor": 0
},
@@ -100,136 +174,4269 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "[INFO] Processed another 10 blocks, for a total of 10\n",
- "[INFO] Processed another 10 blocks, for a total of 20\n",
- "[INFO] Processed another 10 blocks, for a total of 30\n",
- "[INFO] Processed another 10 blocks, for a total of 40\n",
- "[INFO] Processed another 10 blocks, for a total of 50\n",
- "[INFO] Processed another 10 blocks, for a total of 60\n",
- "[INFO] Processed another 10 blocks, for a total of 70\n",
- "[INFO] Processed another 10 blocks, for a total of 80\n",
- "[INFO] Processed another 10 blocks, for a total of 90\n",
- "[INFO] Processed another 10 blocks, for a total of 100\n",
- "[INFO] Processed another 10 blocks, for a total of 110\n",
- "[INFO] Processed another 10 blocks, for a total of 120\n",
- "[INFO] Processed another 10 blocks, for a total of 130\n",
- "[INFO] Processed another 10 blocks, for a total of 140\n",
- "[INFO] Processed another 10 blocks, for a total of 150\n",
- "[INFO] Processed another 10 blocks, for a total of 160\n",
- "[INFO] Processed another 10 blocks, for a total of 170\n",
- "[INFO] Processed another 10 blocks, for a total of 180\n",
- "[INFO] Processed another 10 blocks, for a total of 190\n",
- "[INFO] Processed another 10 blocks, for a total of 200\n",
- "[INFO] Processed another 10 blocks, for a total of 210\n",
- "[INFO] Processed another 10 blocks, for a total of 220\n",
- "[INFO] Processed another 10 blocks, for a total of 230\n",
- "[INFO] Processed another 10 blocks, for a total of 240\n",
- "[INFO] Processed another 10 blocks, for a total of 250\n",
- "[INFO] Processed another 10 blocks, for a total of 260\n",
- "[INFO] Processed another 10 blocks, for a total of 270\n",
- "[INFO] Processed another 10 blocks, for a total of 280\n",
- "[INFO] Processed another 10 blocks, for a total of 290\n",
- "[INFO] Processed another 10 blocks, for a total of 300\n",
- "[INFO] Processed another 10 blocks, for a total of 310\n",
- "[INFO] Processed another 10 blocks, for a total of 320\n",
- "[INFO] Processed another 10 blocks, for a total of 330\n",
- "[INFO] Processed another 10 blocks, for a total of 340\n",
- "[INFO] Processed another 10 blocks, for a total of 350\n",
- "[INFO] Processed another 7 blocks, for a total of 357\n",
- "[INFO] Processed another 0 blocks, for a total of 357\n",
- "[INFO] Processed another 0 blocks, for a total of 357\n",
- "[INFO] Processed another 0 blocks, for a total of 357\n",
- "[INFO] Processed another 0 blocks, for a total of 357\n"
+ "[INFO] Processed another 1 items, for a total of 1\n",
+ "[WARN] Possibly page without transcript! https://scrapsfromtheloft.com/2020/05/05/bill-burr-late-show-with-david-letterman-2010/\n",
+ "[INFO] Processed another 1 items, for a total of 2\n",
+ "[INFO] Processed another 1 items, for a total of 3\n",
+ "[INFO] Processed another 1 items, for a total of 4\n",
+ "[INFO] Processed another 1 items, for a total of 5\n",
+ "[INFO] Processed another 1 items, for a total of 6\n",
+ "[INFO] Processed another 1 items, for a total of 7\n",
+ "[INFO] Processed another 1 items, for a total of 8\n",
+ "[INFO] Processed another 1 items, for a total of 9\n",
+ "[INFO] Processed another 1 items, for a total of 10\n",
+ "[INFO] Processed another 1 items, for a total of 11\n",
+ "[INFO] Processed another 1 items, for a total of 12\n",
+ "[INFO] Processed another 1 items, for a total of 13\n"
+ ]
+ },
+ {
+ "ename": "KeyboardInterrupt",
+ "evalue": "",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[1;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
+ "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 58\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0maccumulator\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 59\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 60\u001b[1;33m \u001b[0mscrap_page\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mscrap_transcript\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl_transcripts\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mn_batches\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m40\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0msave_to_files\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[1;32m\u001b[0m in \u001b[0;36mscrap_page\u001b[1;34m(func, urls_gen, n_batches, save_to_files)\u001b[0m\n\u001b[0;32m 44\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 45\u001b[0m \u001b[0mpbar\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mset_description\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'Loading {} page...'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m+\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 46\u001b[1;33m \u001b[0mpage\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mrequests\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 47\u001b[0m \u001b[0mpbar\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mset_description\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'Processing {} page...'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m+\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 48\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mrequests\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexceptions\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mConnectionError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\requests\\api.py\u001b[0m in \u001b[0;36mget\u001b[1;34m(url, params, **kwargs)\u001b[0m\n\u001b[0;32m 74\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 75\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msetdefault\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'allow_redirects'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 76\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mrequest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'get'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mparams\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 77\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 78\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\requests\\api.py\u001b[0m in \u001b[0;36mrequest\u001b[1;34m(method, url, **kwargs)\u001b[0m\n\u001b[0;32m 59\u001b[0m \u001b[1;31m# cases, and look like a memory leak in others.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 60\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0msessions\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSession\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0msession\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 61\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0msession\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 62\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 63\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\requests\\sessions.py\u001b[0m in \u001b[0;36mrequest\u001b[1;34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[0m\n\u001b[0;32m 528\u001b[0m }\n\u001b[0;32m 529\u001b[0m \u001b[0msend_kwargs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msettings\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 530\u001b[1;33m \u001b[0mresp\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mprep\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0msend_kwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 531\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 532\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mresp\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\requests\\sessions.py\u001b[0m in \u001b[0;36msend\u001b[1;34m(self, request, **kwargs)\u001b[0m\n\u001b[0;32m 681\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 682\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mstream\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 683\u001b[1;33m \u001b[0mr\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcontent\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 684\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 685\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mr\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\requests\\models.py\u001b[0m in \u001b[0;36mcontent\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 827\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_content\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 828\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 829\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_content\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34mb''\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0miter_content\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mCONTENT_CHUNK_SIZE\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mor\u001b[0m \u001b[1;34mb''\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 830\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 831\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_content_consumed\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\requests\\models.py\u001b[0m in \u001b[0;36mgenerate\u001b[1;34m()\u001b[0m\n\u001b[0;32m 749\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mraw\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'stream'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 750\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 751\u001b[1;33m \u001b[1;32mfor\u001b[0m \u001b[0mchunk\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mraw\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstream\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mchunk_size\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdecode_content\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 752\u001b[0m \u001b[1;32myield\u001b[0m \u001b[0mchunk\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 753\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mProtocolError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\urllib3\\response.py\u001b[0m in \u001b[0;36mstream\u001b[1;34m(self, amt, decode_content)\u001b[0m\n\u001b[0;32m 558\u001b[0m \"\"\"\n\u001b[0;32m 559\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchunked\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msupports_chunked_reads\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 560\u001b[1;33m \u001b[1;32mfor\u001b[0m \u001b[0mline\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_chunked\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mamt\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdecode_content\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdecode_content\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 561\u001b[0m \u001b[1;32myield\u001b[0m \u001b[0mline\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 562\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\urllib3\\response.py\u001b[0m in \u001b[0;36mread_chunked\u001b[1;34m(self, amt, decode_content)\u001b[0m\n\u001b[0;32m 753\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchunk_left\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 754\u001b[0m \u001b[1;32mbreak\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 755\u001b[1;33m \u001b[0mchunk\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_handle_chunk\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mamt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 756\u001b[0m decoded = self._decode(\n\u001b[0;32m 757\u001b[0m \u001b[0mchunk\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdecode_content\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdecode_content\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mflush_decoder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\urllib3\\response.py\u001b[0m in \u001b[0;36m_handle_chunk\u001b[1;34m(self, amt)\u001b[0m\n\u001b[0;32m 697\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchunk_left\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 698\u001b[0m \u001b[1;32melif\u001b[0m \u001b[0mamt\u001b[0m \u001b[1;33m<\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchunk_left\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 699\u001b[1;33m \u001b[0mvalue\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_fp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_safe_read\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mamt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 700\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchunk_left\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchunk_left\u001b[0m \u001b[1;33m-\u001b[0m \u001b[0mamt\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 701\u001b[0m \u001b[0mreturned_chunk\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\http\\client.py\u001b[0m in \u001b[0;36m_safe_read\u001b[1;34m(self, amt)\u001b[0m\n\u001b[0;32m 618\u001b[0m \u001b[0ms\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 619\u001b[0m \u001b[1;32mwhile\u001b[0m \u001b[0mamt\u001b[0m \u001b[1;33m>\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 620\u001b[1;33m \u001b[0mchunk\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mamt\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mMAXAMOUNT\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 621\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mchunk\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 622\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mIncompleteRead\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mb''\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ms\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mamt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\socket.py\u001b[0m in \u001b[0;36mreadinto\u001b[1;34m(self, b)\u001b[0m\n\u001b[0;32m 587\u001b[0m \u001b[1;32mwhile\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 588\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 589\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv_into\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 590\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 591\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_timeout_occurred\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\urllib3\\contrib\\pyopenssl.py\u001b[0m in \u001b[0;36mrecv_into\u001b[1;34m(self, *args, **kwargs)\u001b[0m\n\u001b[0;32m 311\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mrecv_into\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 312\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 313\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mconnection\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv_into\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 314\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mOpenSSL\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSSL\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSysCallError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 315\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msuppress_ragged_eofs\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0margs\u001b[0m \u001b[1;33m==\u001b[0m \u001b[1;33m(\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"Unexpected EOF\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;32m~\\Anaconda3\\envs\\pytorch\\lib\\site-packages\\OpenSSL\\SSL.py\u001b[0m in \u001b[0;36mrecv_into\u001b[1;34m(self, buffer, nbytes, flags)\u001b[0m\n\u001b[0;32m 1837\u001b[0m \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_lib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSSL_peek\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_ssl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbuf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnbytes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1838\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1839\u001b[1;33m \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_lib\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSSL_read\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_ssl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbuf\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnbytes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1840\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_raise_ssl_error\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_ssl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1841\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+ "\u001b[1;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
- "URL = 'https://scrapsfromtheloft.com/comedy/page/{}/'\n",
- "\n",
- "def save_file(path, txt, encoding=None):\n",
- " # Create the corresponding folder (if needed)\n",
- " os.makedirs(os.path.dirname(path), exist_ok=True)\n",
- " with open(path, 'w', encoding=encoding) as out_file:\n",
- " out_file.writelines(txt)\n",
- "\n",
- "\n",
- "def process_block(block):\n",
- " result = []\n",
- " # If not text, skip\n",
- " try:\n",
- " if block.name in ['img'] or (block.name == 'div' and block.get('class', [''])[0] == 'yarpp-related'):\n",
- " return result\n",
- " except AttributeError:\n",
- " print('[ERROR] AttributeError!')\n",
- " print(type(block))\n",
- " print(block)\n",
- " print('---------------------------------')\n",
- " # If is a tag, process it's content\n",
- " if isinstance(block, Tag):\n",
- " for sub_block in block.contents:\n",
- " result.extend(process_block(sub_block))\n",
- " return result\n",
- " for regexp, sub_str in regexps:\n",
- " block = regexp.sub(sub_str, block)\n",
- " block = block.strip()\n",
- " if block:\n",
- " result.append(block)\n",
- " return result\n",
- "\n",
- "\n",
- "def scrap_transcript(url, file_path):\n",
- " try:\n",
- " transcript_page = requests.get(url)\n",
- " except requests.exceptions.ConnectionError:\n",
- " print('[ERROR] Connection error to', url)\n",
- " return\n",
- " transcript_soup = BeautifulSoup(transcript_page.content, 'html.parser')\n",
- " content_blocks = transcript_soup.findAll('div', 'post-content')\n",
+ "def scrap_transcript(soup, transcript_url):\n",
+ " file_name = transcript_url[:-1].rsplit('/', 1)[-1]\n",
+ " file_path = os.path.join('test', file_name + '.txt')\n",
+ " if not ('transcript' in file_path.lower()):\n",
+ " print('[WARN] Possibly page without transcript!', transcript_url)\n",
+ " content_blocks = soup.findAll('div', 'post-content')\n",
" if len(content_blocks) != 1:\n",
" print('[WARN] strange content in', url)\n",
" return\n",
" content = process_block(content_blocks[0])\n",
+ " # Merge smal paragraphs.\n",
" stripped_content = ['']\n",
" for line in content:\n",
" if len(stripped_content[-1]) < 200:\n",
" stripped_content[-1] += ' ' + line\n",
" else:\n",
" stripped_content.append(line)\n",
- " save_file(file_path, '\\n'.join(stripped_content), encoding='utf8')\n",
+ " stripped_content = '\\n'.join(stripped_content)\n",
+ " if file_path:\n",
+ " save_file(file_path, stripped_content, encoding='utf8')\n",
+ " return [stripped_content]\n",
+ "\n",
+ "def url_transcripts(n_batches, pbar):\n",
+ " URL = 'https://scrapsfromtheloft.com/comedy/page/{}/'\n",
+ " for i in range(n_batches):\n",
+ " # Load block of pages\n",
+ " pbar.set_description('Loading {} block...'.format(i+1))\n",
+ " page = requests.get(URL.format(i))\n",
+ " soup = BeautifulSoup(page.content, 'html.parser')\n",
+ " blocks = soup.body.findAll('div', 'fusion-post-content post-content')\n",
+ " # Extract link to the page\n",
+ " for j, block in enumerate(blocks):\n",
+ " block_title = block.find('h2', 'entry-title fusion-post-title').a\n",
+ " yield block_title['href']\n",
+ "\n",
+ "scrap_page(scrap_transcript, url_transcripts, n_batches=40, save_to_files=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Russian\n",
"\n",
+ "### http://anecdotica.ru/"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "b4905dc2a6224f09ac023874ef2c70a3",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 25 items, for a total of 25\n",
+ "[INFO] Processed another 25 items, for a total of 50\n",
+ "[INFO] Processed another 25 items, for a total of 75\n",
+ "[INFO] Processed another 25 items, for a total of 100\n",
+ "[INFO] Processed another 25 items, for a total of 125\n",
+ "[INFO] Processed another 25 items, for a total of 150\n",
+ "[INFO] Processed another 25 items, for a total of 175\n",
+ "[INFO] Processed another 25 items, for a total of 200\n",
+ "[INFO] Processed another 24 items, for a total of 224\n",
+ "[INFO] Processed another 25 items, for a total of 249\n",
+ "[INFO] Processed another 25 items, for a total of 274\n",
+ "[INFO] Processed another 25 items, for a total of 299\n",
+ "[INFO] Processed another 25 items, for a total of 324\n",
+ "[INFO] Processed another 25 items, for a total of 349\n",
+ "[INFO] Processed another 25 items, for a total of 374\n",
+ "[INFO] Processed another 25 items, for a total of 399\n",
+ "[INFO] Processed another 25 items, for a total of 424\n",
+ "[INFO] Processed another 24 items, for a total of 448\n",
+ "[INFO] Processed another 25 items, for a total of 473\n",
+ "[INFO] Processed another 25 items, for a total of 498\n",
+ "[INFO] Processed another 25 items, for a total of 523\n",
+ "[INFO] Processed another 25 items, for a total of 548\n",
+ "[INFO] Processed another 24 items, for a total of 572\n",
+ "[INFO] Processed another 25 items, for a total of 597\n",
+ "[INFO] Processed another 25 items, for a total of 622\n",
+ "[INFO] Processed another 25 items, for a total of 647\n",
+ "[INFO] Processed another 23 items, for a total of 670\n",
+ "[INFO] Processed another 24 items, for a total of 694\n",
+ "[INFO] Processed another 25 items, for a total of 719\n",
+ "[INFO] Processed another 25 items, for a total of 744\n",
+ "[INFO] Processed another 25 items, for a total of 769\n",
+ "[INFO] Processed another 25 items, for a total of 794\n",
+ "[INFO] Processed another 25 items, for a total of 819\n",
+ "[INFO] Processed another 25 items, for a total of 844\n",
+ "[INFO] Processed another 25 items, for a total of 869\n",
+ "[INFO] Processed another 25 items, for a total of 894\n",
+ "[INFO] Processed another 25 items, for a total of 919\n",
+ "[INFO] Processed another 25 items, for a total of 944\n",
+ "[INFO] Processed another 25 items, for a total of 969\n",
+ "[INFO] Processed another 25 items, for a total of 994\n",
+ "[INFO] Processed another 25 items, for a total of 1019\n",
+ "[INFO] Processed another 25 items, for a total of 1044\n",
+ "[INFO] Processed another 25 items, for a total of 1069\n",
+ "[INFO] Processed another 25 items, for a total of 1094\n",
+ "[INFO] Processed another 25 items, for a total of 1119\n",
+ "[INFO] Processed another 25 items, for a total of 1144\n",
+ "[INFO] Processed another 25 items, for a total of 1169\n",
+ "[INFO] Processed another 25 items, for a total of 1194\n",
+ "[INFO] Processed another 25 items, for a total of 1219\n",
+ "[INFO] Processed another 25 items, for a total of 1244\n",
+ "[INFO] Processed another 25 items, for a total of 1269\n",
+ "[INFO] Processed another 25 items, for a total of 1294\n",
+ "[INFO] Processed another 25 items, for a total of 1319\n",
+ "[INFO] Processed another 25 items, for a total of 1344\n",
+ "[INFO] Processed another 25 items, for a total of 1369\n",
+ "[INFO] Processed another 25 items, for a total of 1394\n",
+ "[INFO] Processed another 25 items, for a total of 1419\n",
+ "[INFO] Processed another 25 items, for a total of 1444\n",
+ "[INFO] Processed another 25 items, for a total of 1469\n",
+ "[INFO] Processed another 25 items, for a total of 1494\n",
+ "[INFO] Processed another 25 items, for a total of 1519\n",
+ "[INFO] Processed another 25 items, for a total of 1544\n",
+ "[INFO] Processed another 25 items, for a total of 1569\n",
+ "[INFO] Processed another 25 items, for a total of 1594\n",
+ "[INFO] Processed another 25 items, for a total of 1619\n",
+ "[INFO] Processed another 25 items, for a total of 1644\n",
+ "[INFO] Processed another 25 items, for a total of 1669\n",
+ "[INFO] Processed another 25 items, for a total of 1694\n",
+ "[INFO] Processed another 25 items, for a total of 1719\n",
+ "[INFO] Processed another 25 items, for a total of 1744\n",
+ "[INFO] Processed another 25 items, for a total of 1769\n",
+ "[INFO] Processed another 25 items, for a total of 1794\n",
+ "[INFO] Processed another 25 items, for a total of 1819\n",
+ "[INFO] Processed another 25 items, for a total of 1844\n",
+ "[INFO] Processed another 25 items, for a total of 1869\n",
+ "[INFO] Processed another 25 items, for a total of 1894\n",
+ "[INFO] Processed another 25 items, for a total of 1919\n",
+ "[INFO] Processed another 25 items, for a total of 1944\n",
+ "[INFO] Processed another 25 items, for a total of 1969\n",
+ "[INFO] Processed another 25 items, for a total of 1994\n",
+ "[INFO] Processed another 25 items, for a total of 2019\n",
+ "[INFO] Processed another 25 items, for a total of 2044\n",
+ "[INFO] Processed another 25 items, for a total of 2069\n",
+ "[INFO] Processed another 25 items, for a total of 2094\n",
+ "[INFO] Processed another 25 items, for a total of 2119\n",
+ "[INFO] Processed another 25 items, for a total of 2144\n",
+ "[INFO] Processed another 25 items, for a total of 2169\n",
+ "[INFO] Processed another 25 items, for a total of 2194\n",
+ "[INFO] Processed another 25 items, for a total of 2219\n",
+ "[INFO] Processed another 25 items, for a total of 2244\n",
+ "[INFO] Processed another 25 items, for a total of 2269\n",
+ "[INFO] Processed another 25 items, for a total of 2294\n",
+ "[INFO] Processed another 25 items, for a total of 2319\n",
+ "[INFO] Processed another 25 items, for a total of 2344\n",
+ "[INFO] Processed another 25 items, for a total of 2369\n",
+ "[INFO] Processed another 25 items, for a total of 2394\n",
+ "[INFO] Processed another 25 items, for a total of 2419\n",
+ "[INFO] Processed another 25 items, for a total of 2444\n",
+ "[INFO] Processed another 24 items, for a total of 2468\n",
+ "[INFO] Processed another 25 items, for a total of 2493\n",
+ "[INFO] Processed another 25 items, for a total of 2518\n",
+ "[INFO] Processed another 24 items, for a total of 2542\n",
+ "[INFO] Processed another 25 items, for a total of 2567\n",
+ "[INFO] Processed another 25 items, for a total of 2592\n",
+ "[INFO] Processed another 25 items, for a total of 2617\n",
+ "[INFO] Processed another 25 items, for a total of 2642\n",
+ "[INFO] Processed another 25 items, for a total of 2667\n",
+ "[INFO] Processed another 25 items, for a total of 2692\n",
+ "[INFO] Processed another 25 items, for a total of 2717\n",
+ "[INFO] Processed another 25 items, for a total of 2742\n",
+ "[INFO] Processed another 25 items, for a total of 2767\n",
+ "[INFO] Processed another 24 items, for a total of 2791\n",
+ "[INFO] Processed another 25 items, for a total of 2816\n",
+ "[INFO] Processed another 25 items, for a total of 2841\n",
+ "[INFO] Processed another 25 items, for a total of 2866\n",
+ "[INFO] Processed another 25 items, for a total of 2891\n",
+ "[INFO] Processed another 25 items, for a total of 2916\n",
+ "[INFO] Processed another 25 items, for a total of 2941\n",
+ "[INFO] Processed another 24 items, for a total of 2965\n",
+ "[INFO] Processed another 25 items, for a total of 2990\n",
+ "[INFO] Processed another 25 items, for a total of 3015\n",
+ "[INFO] Processed another 25 items, for a total of 3040\n",
+ "[INFO] Processed another 25 items, for a total of 3065\n",
+ "[INFO] Processed another 24 items, for a total of 3089\n",
+ "[INFO] Processed another 25 items, for a total of 3114\n",
+ "[INFO] Processed another 25 items, for a total of 3139\n",
+ "[INFO] Processed another 25 items, for a total of 3164\n",
+ "[INFO] Processed another 25 items, for a total of 3189\n",
+ "[INFO] Processed another 25 items, for a total of 3214\n",
+ "[INFO] Processed another 25 items, for a total of 3239\n",
+ "[INFO] Processed another 25 items, for a total of 3264\n",
+ "[INFO] Processed another 25 items, for a total of 3289\n",
+ "[INFO] Processed another 24 items, for a total of 3313\n",
+ "[INFO] Processed another 25 items, for a total of 3338\n",
+ "[INFO] Processed another 25 items, for a total of 3363\n",
+ "[INFO] Processed another 25 items, for a total of 3388\n",
+ "[INFO] Processed another 25 items, for a total of 3413\n",
+ "[INFO] Processed another 24 items, for a total of 3437\n",
+ "[INFO] Processed another 25 items, for a total of 3462\n",
+ "[INFO] Processed another 25 items, for a total of 3487\n",
+ "[INFO] Processed another 25 items, for a total of 3512\n",
+ "[INFO] Processed another 25 items, for a total of 3537\n",
+ "[INFO] Processed another 25 items, for a total of 3562\n",
+ "[INFO] Processed another 25 items, for a total of 3587\n",
+ "[INFO] Processed another 25 items, for a total of 3612\n",
+ "[INFO] Processed another 25 items, for a total of 3637\n",
+ "[INFO] Processed another 22 items, for a total of 3659\n",
+ "[INFO] Processed another 23 items, for a total of 3682\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 25 items, for a total of 3707\n",
+ "[INFO] Processed another 25 items, for a total of 3732\n",
+ "[INFO] Processed another 25 items, for a total of 3757\n",
+ "[INFO] Processed another 25 items, for a total of 3782\n",
+ "[INFO] Processed another 25 items, for a total of 3807\n",
+ "[INFO] Processed another 25 items, for a total of 3832\n",
+ "[INFO] Processed another 25 items, for a total of 3857\n",
+ "[INFO] Processed another 25 items, for a total of 3882\n",
+ "[INFO] Processed another 25 items, for a total of 3907\n",
+ "[INFO] Processed another 25 items, for a total of 3932\n",
+ "[INFO] Processed another 24 items, for a total of 3956\n",
+ "[INFO] Processed another 25 items, for a total of 3981\n",
+ "[INFO] Processed another 25 items, for a total of 4006\n",
+ "[INFO] Processed another 25 items, for a total of 4031\n",
+ "[INFO] Processed another 25 items, for a total of 4056\n",
+ "[INFO] Processed another 25 items, for a total of 4081\n",
+ "[INFO] Processed another 25 items, for a total of 4106\n",
+ "[INFO] Processed another 25 items, for a total of 4131\n",
+ "[INFO] Processed another 25 items, for a total of 4156\n",
+ "[INFO] Processed another 25 items, for a total of 4181\n",
+ "[INFO] Processed another 25 items, for a total of 4206\n",
+ "[INFO] Processed another 25 items, for a total of 4231\n",
+ "[INFO] Processed another 25 items, for a total of 4256\n",
+ "[INFO] Processed another 25 items, for a total of 4281\n",
+ "[INFO] Processed another 25 items, for a total of 4306\n",
+ "[INFO] Processed another 25 items, for a total of 4331\n",
+ "[INFO] Processed another 25 items, for a total of 4356\n",
+ "[INFO] Processed another 25 items, for a total of 4381\n",
+ "[INFO] Processed another 25 items, for a total of 4406\n",
+ "[INFO] Processed another 25 items, for a total of 4431\n",
+ "[INFO] Processed another 25 items, for a total of 4456\n",
+ "[INFO] Processed another 25 items, for a total of 4481\n",
+ "[INFO] Processed another 25 items, for a total of 4506\n",
+ "[INFO] Processed another 25 items, for a total of 4531\n",
+ "[INFO] Processed another 25 items, for a total of 4556\n",
+ "[INFO] Processed another 24 items, for a total of 4580\n",
+ "[INFO] Processed another 25 items, for a total of 4605\n",
+ "[INFO] Processed another 25 items, for a total of 4630\n",
+ "[INFO] Processed another 25 items, for a total of 4655\n",
+ "[INFO] Processed another 25 items, for a total of 4680\n",
+ "[INFO] Processed another 25 items, for a total of 4705\n",
+ "[INFO] Processed another 25 items, for a total of 4730\n",
+ "[INFO] Processed another 25 items, for a total of 4755\n",
+ "[INFO] Processed another 25 items, for a total of 4780\n",
+ "[INFO] Processed another 24 items, for a total of 4804\n",
+ "[INFO] Processed another 25 items, for a total of 4829\n",
+ "[INFO] Processed another 25 items, for a total of 4854\n",
+ "[INFO] Processed another 25 items, for a total of 4879\n",
+ "[INFO] Processed another 25 items, for a total of 4904\n",
+ "[INFO] Processed another 25 items, for a total of 4929\n",
+ "[INFO] Processed another 25 items, for a total of 4954\n",
+ "[INFO] Processed another 25 items, for a total of 4979\n",
+ "[INFO] Processed another 25 items, for a total of 5004\n",
+ "[INFO] Processed another 25 items, for a total of 5029\n",
+ "[INFO] Processed another 25 items, for a total of 5054\n",
+ "[INFO] Processed another 25 items, for a total of 5079\n",
+ "[INFO] Processed another 25 items, for a total of 5104\n",
+ "[INFO] Processed another 25 items, for a total of 5129\n",
+ "[INFO] Processed another 25 items, for a total of 5154\n",
+ "[INFO] Processed another 25 items, for a total of 5179\n",
+ "[INFO] Processed another 25 items, for a total of 5204\n",
+ "[INFO] Processed another 25 items, for a total of 5229\n",
+ "[INFO] Processed another 25 items, for a total of 5254\n",
+ "[INFO] Processed another 25 items, for a total of 5279\n",
+ "[INFO] Processed another 25 items, for a total of 5304\n",
+ "[INFO] Processed another 25 items, for a total of 5329\n",
+ "[INFO] Processed another 25 items, for a total of 5354\n",
+ "[INFO] Processed another 25 items, for a total of 5379\n",
+ "[INFO] Processed another 25 items, for a total of 5404\n",
+ "[INFO] Processed another 25 items, for a total of 5429\n",
+ "[INFO] Processed another 25 items, for a total of 5454\n",
+ "[INFO] Processed another 25 items, for a total of 5479\n",
+ "[INFO] Processed another 25 items, for a total of 5504\n",
+ "[INFO] Processed another 25 items, for a total of 5529\n",
+ "[INFO] Processed another 25 items, for a total of 5554\n",
+ "[INFO] Processed another 25 items, for a total of 5579\n",
+ "[INFO] Processed another 25 items, for a total of 5604\n",
+ "[INFO] Processed another 25 items, for a total of 5629\n",
+ "[INFO] Processed another 25 items, for a total of 5654\n",
+ "[INFO] Processed another 24 items, for a total of 5678\n",
+ "[INFO] Processed another 25 items, for a total of 5703\n",
+ "[INFO] Processed another 23 items, for a total of 5726\n",
+ "[INFO] Processed another 24 items, for a total of 5750\n",
+ "[INFO] Processed another 24 items, for a total of 5774\n",
+ "[INFO] Processed another 25 items, for a total of 5799\n",
+ "[INFO] Processed another 25 items, for a total of 5824\n",
+ "[INFO] Processed another 25 items, for a total of 5849\n",
+ "[INFO] Processed another 25 items, for a total of 5874\n",
+ "[INFO] Processed another 25 items, for a total of 5899\n",
+ "[INFO] Processed another 25 items, for a total of 5924\n",
+ "[INFO] Processed another 25 items, for a total of 5949\n",
+ "[INFO] Processed another 25 items, for a total of 5974\n",
+ "[INFO] Processed another 25 items, for a total of 5999\n",
+ "[INFO] Processed another 25 items, for a total of 6024\n",
+ "[INFO] Processed another 25 items, for a total of 6049\n",
+ "[INFO] Processed another 25 items, for a total of 6074\n",
+ "[INFO] Processed another 25 items, for a total of 6099\n",
+ "[INFO] Processed another 25 items, for a total of 6124\n",
+ "[INFO] Processed another 25 items, for a total of 6149\n",
+ "[INFO] Processed another 25 items, for a total of 6174\n",
+ "[INFO] Processed another 25 items, for a total of 6199\n",
+ "[INFO] Processed another 25 items, for a total of 6224\n",
+ "[INFO] Processed another 25 items, for a total of 6249\n",
+ "[INFO] Processed another 25 items, for a total of 6274\n",
+ "[INFO] Processed another 25 items, for a total of 6299\n",
+ "[INFO] Processed another 25 items, for a total of 6324\n",
+ "[INFO] Processed another 25 items, for a total of 6349\n",
+ "[INFO] Processed another 25 items, for a total of 6374\n",
+ "[INFO] Processed another 25 items, for a total of 6399\n",
+ "[INFO] Processed another 25 items, for a total of 6424\n",
+ "[INFO] Processed another 25 items, for a total of 6449\n",
+ "[INFO] Processed another 25 items, for a total of 6474\n",
+ "[INFO] Processed another 25 items, for a total of 6499\n",
+ "[INFO] Processed another 25 items, for a total of 6524\n",
+ "[INFO] Processed another 25 items, for a total of 6549\n",
+ "[INFO] Processed another 25 items, for a total of 6574\n",
+ "[INFO] Processed another 25 items, for a total of 6599\n",
+ "[INFO] Processed another 25 items, for a total of 6624\n",
+ "[INFO] Processed another 25 items, for a total of 6649\n",
+ "[INFO] Processed another 25 items, for a total of 6674\n",
+ "[INFO] Processed another 25 items, for a total of 6699\n",
+ "[INFO] Processed another 25 items, for a total of 6724\n",
+ "[INFO] Processed another 25 items, for a total of 6749\n",
+ "[INFO] Processed another 25 items, for a total of 6774\n",
+ "[INFO] Processed another 25 items, for a total of 6799\n",
+ "[INFO] Processed another 25 items, for a total of 6824\n",
+ "[INFO] Processed another 25 items, for a total of 6849\n",
+ "[INFO] Processed another 25 items, for a total of 6874\n",
+ "[INFO] Processed another 25 items, for a total of 6899\n",
+ "[INFO] Processed another 25 items, for a total of 6924\n",
+ "[INFO] Processed another 22 items, for a total of 6946\n",
+ "[INFO] Processed another 23 items, for a total of 6969\n",
+ "[INFO] Processed another 25 items, for a total of 6994\n",
+ "[INFO] Processed another 25 items, for a total of 7019\n",
+ "[INFO] Processed another 25 items, for a total of 7044\n",
+ "[INFO] Processed another 25 items, for a total of 7069\n",
+ "[INFO] Processed another 25 items, for a total of 7094\n",
+ "[INFO] Processed another 25 items, for a total of 7119\n",
+ "[INFO] Processed another 25 items, for a total of 7144\n",
+ "[INFO] Processed another 25 items, for a total of 7169\n",
+ "[INFO] Processed another 25 items, for a total of 7194\n",
+ "[INFO] Processed another 25 items, for a total of 7219\n",
+ "[INFO] Processed another 25 items, for a total of 7244\n",
+ "[INFO] Processed another 25 items, for a total of 7269\n",
+ "[INFO] Processed another 25 items, for a total of 7294\n",
+ "[INFO] Processed another 25 items, for a total of 7319\n",
+ "[INFO] Processed another 25 items, for a total of 7344\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 25 items, for a total of 7369\n",
+ "[INFO] Processed another 25 items, for a total of 7394\n",
+ "[INFO] Processed another 25 items, for a total of 7419\n",
+ "[INFO] Processed another 25 items, for a total of 7444\n",
+ "[INFO] Processed another 25 items, for a total of 7469\n",
+ "[INFO] Processed another 25 items, for a total of 7494\n",
+ "[INFO] Processed another 25 items, for a total of 7519\n",
+ "[INFO] Processed another 25 items, for a total of 7544\n",
+ "[INFO] Processed another 25 items, for a total of 7569\n",
+ "[INFO] Processed another 25 items, for a total of 7594\n",
+ "[INFO] Processed another 25 items, for a total of 7619\n",
+ "[INFO] Processed another 25 items, for a total of 7644\n",
+ "[INFO] Processed another 25 items, for a total of 7669\n",
+ "[INFO] Processed another 25 items, for a total of 7694\n",
+ "[INFO] Processed another 25 items, for a total of 7719\n",
+ "[INFO] Processed another 25 items, for a total of 7744\n",
+ "[INFO] Processed another 25 items, for a total of 7769\n",
+ "[INFO] Processed another 25 items, for a total of 7794\n",
+ "[INFO] Processed another 25 items, for a total of 7819\n",
+ "[INFO] Processed another 25 items, for a total of 7844\n",
+ "[INFO] Processed another 25 items, for a total of 7869\n",
+ "[INFO] Processed another 25 items, for a total of 7894\n",
+ "[INFO] Processed another 25 items, for a total of 7919\n",
+ "[INFO] Processed another 25 items, for a total of 7944\n",
+ "[INFO] Processed another 25 items, for a total of 7969\n",
+ "[INFO] Processed another 25 items, for a total of 7994\n",
+ "[INFO] Processed another 25 items, for a total of 8019\n",
+ "[INFO] Processed another 25 items, for a total of 8044\n",
+ "[INFO] Processed another 24 items, for a total of 8068\n",
+ "[INFO] Processed another 25 items, for a total of 8093\n",
+ "[INFO] Processed another 25 items, for a total of 8118\n",
+ "[INFO] Processed another 25 items, for a total of 8143\n",
+ "[INFO] Processed another 25 items, for a total of 8168\n",
+ "[INFO] Processed another 24 items, for a total of 8192\n",
+ "[INFO] Processed another 25 items, for a total of 8217\n",
+ "[INFO] Processed another 25 items, for a total of 8242\n",
+ "[INFO] Processed another 25 items, for a total of 8267\n",
+ "[INFO] Processed another 25 items, for a total of 8292\n",
+ "[INFO] Processed another 25 items, for a total of 8317\n",
+ "[INFO] Processed another 25 items, for a total of 8342\n",
+ "[INFO] Processed another 25 items, for a total of 8367\n",
+ "[INFO] Processed another 25 items, for a total of 8392\n",
+ "[INFO] Processed another 25 items, for a total of 8417\n",
+ "[INFO] Processed another 25 items, for a total of 8442\n",
+ "[INFO] Processed another 25 items, for a total of 8467\n",
+ "[INFO] Processed another 25 items, for a total of 8492\n",
+ "[INFO] Processed another 25 items, for a total of 8517\n",
+ "[INFO] Processed another 25 items, for a total of 8542\n",
+ "[INFO] Processed another 25 items, for a total of 8567\n",
+ "[INFO] Processed another 25 items, for a total of 8592\n",
+ "[INFO] Processed another 25 items, for a total of 8617\n",
+ "[INFO] Processed another 25 items, for a total of 8642\n",
+ "[INFO] Processed another 25 items, for a total of 8667\n",
+ "[INFO] Processed another 25 items, for a total of 8692\n",
+ "[INFO] Processed another 25 items, for a total of 8717\n",
+ "[INFO] Processed another 25 items, for a total of 8742\n",
+ "[INFO] Processed another 25 items, for a total of 8767\n",
+ "[INFO] Processed another 25 items, for a total of 8792\n",
+ "[INFO] Processed another 25 items, for a total of 8817\n",
+ "[INFO] Processed another 24 items, for a total of 8841\n",
+ "[INFO] Processed another 25 items, for a total of 8866\n",
+ "[INFO] Processed another 25 items, for a total of 8891\n",
+ "[INFO] Processed another 25 items, for a total of 8916\n",
+ "[INFO] Processed another 25 items, for a total of 8941\n",
+ "[INFO] Processed another 25 items, for a total of 8966\n",
+ "[INFO] Processed another 25 items, for a total of 8991\n",
+ "[INFO] Processed another 25 items, for a total of 9016\n",
+ "[INFO] Processed another 24 items, for a total of 9040\n",
+ "[INFO] Processed another 25 items, for a total of 9065\n",
+ "[INFO] Processed another 25 items, for a total of 9090\n",
+ "[INFO] Processed another 25 items, for a total of 9115\n",
+ "[INFO] Processed another 25 items, for a total of 9140\n",
+ "[INFO] Processed another 25 items, for a total of 9165\n",
+ "[INFO] Processed another 25 items, for a total of 9190\n",
+ "[INFO] Processed another 25 items, for a total of 9215\n",
+ "[INFO] Processed another 25 items, for a total of 9240\n",
+ "[INFO] Processed another 25 items, for a total of 9265\n",
+ "[INFO] Processed another 25 items, for a total of 9290\n",
+ "[INFO] Processed another 25 items, for a total of 9315\n",
+ "[INFO] Processed another 25 items, for a total of 9340\n",
+ "[INFO] Processed another 25 items, for a total of 9365\n",
+ "[INFO] Processed another 25 items, for a total of 9390\n",
+ "[INFO] Processed another 25 items, for a total of 9415\n",
+ "[INFO] Processed another 25 items, for a total of 9440\n",
+ "[INFO] Processed another 25 items, for a total of 9465\n",
+ "[INFO] Processed another 25 items, for a total of 9490\n",
+ "[INFO] Processed another 24 items, for a total of 9514\n",
+ "[INFO] Processed another 24 items, for a total of 9538\n",
+ "[INFO] Processed another 25 items, for a total of 9563\n",
+ "[INFO] Processed another 24 items, for a total of 9587\n",
+ "[INFO] Processed another 25 items, for a total of 9612\n",
+ "[INFO] Processed another 25 items, for a total of 9637\n",
+ "[INFO] Processed another 25 items, for a total of 9662\n",
+ "[INFO] Processed another 25 items, for a total of 9687\n",
+ "[INFO] Processed another 25 items, for a total of 9712\n",
+ "[INFO] Processed another 25 items, for a total of 9737\n",
+ "[INFO] Processed another 25 items, for a total of 9762\n",
+ "[INFO] Processed another 23 items, for a total of 9785\n",
+ "[INFO] Processed another 25 items, for a total of 9810\n",
+ "[INFO] Processed another 25 items, for a total of 9835\n",
+ "[INFO] Processed another 25 items, for a total of 9860\n",
+ "[INFO] Processed another 25 items, for a total of 9885\n",
+ "[INFO] Processed another 25 items, for a total of 9910\n",
+ "[INFO] Processed another 25 items, for a total of 9935\n",
+ "[INFO] Processed another 25 items, for a total of 9960\n",
+ "[INFO] Processed another 25 items, for a total of 9985\n",
+ "[INFO] Processed another 25 items, for a total of 10010\n",
+ "[INFO] Processed another 25 items, for a total of 10035\n",
+ "[INFO] Processed another 25 items, for a total of 10060\n",
+ "[INFO] Processed another 25 items, for a total of 10085\n",
+ "[INFO] Processed another 25 items, for a total of 10110\n",
+ "[INFO] Processed another 25 items, for a total of 10135\n",
+ "[INFO] Processed another 24 items, for a total of 10159\n",
+ "[INFO] Processed another 25 items, for a total of 10184\n",
+ "[INFO] Processed another 25 items, for a total of 10209\n",
+ "[INFO] Processed another 25 items, for a total of 10234\n",
+ "[INFO] Processed another 24 items, for a total of 10258\n",
+ "[INFO] Processed another 25 items, for a total of 10283\n",
+ "[INFO] Processed another 25 items, for a total of 10308\n",
+ "[INFO] Processed another 24 items, for a total of 10332\n",
+ "[INFO] Processed another 25 items, for a total of 10357\n",
+ "[INFO] Processed another 24 items, for a total of 10381\n",
+ "[INFO] Processed another 25 items, for a total of 10406\n",
+ "[INFO] Processed another 25 items, for a total of 10431\n",
+ "[INFO] Processed another 25 items, for a total of 10456\n",
+ "[INFO] Processed another 25 items, for a total of 10481\n",
+ "[INFO] Processed another 25 items, for a total of 10506\n",
+ "[INFO] Processed another 25 items, for a total of 10531\n",
+ "[INFO] Processed another 25 items, for a total of 10556\n",
+ "[INFO] Processed another 25 items, for a total of 10581\n",
+ "[INFO] Processed another 25 items, for a total of 10606\n",
+ "[INFO] Processed another 25 items, for a total of 10631\n",
+ "[INFO] Processed another 25 items, for a total of 10656\n",
+ "[INFO] Processed another 25 items, for a total of 10681\n",
+ "[INFO] Processed another 25 items, for a total of 10706\n",
+ "[INFO] Processed another 25 items, for a total of 10731\n",
+ "[INFO] Processed another 25 items, for a total of 10756\n",
+ "[INFO] Processed another 25 items, for a total of 10781\n",
+ "[INFO] Processed another 25 items, for a total of 10806\n",
+ "[INFO] Processed another 25 items, for a total of 10831\n",
+ "[INFO] Processed another 25 items, for a total of 10856\n",
+ "[INFO] Processed another 25 items, for a total of 10881\n",
+ "[INFO] Processed another 25 items, for a total of 10906\n",
+ "[INFO] Processed another 25 items, for a total of 10931\n",
+ "[INFO] Processed another 25 items, for a total of 10956\n",
+ "[INFO] Processed another 25 items, for a total of 10981\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 25 items, for a total of 11006\n",
+ "[INFO] Processed another 25 items, for a total of 11031\n",
+ "[INFO] Processed another 25 items, for a total of 11056\n",
+ "[INFO] Processed another 25 items, for a total of 11081\n",
+ "[INFO] Processed another 25 items, for a total of 11106\n",
+ "[INFO] Processed another 25 items, for a total of 11131\n",
+ "[INFO] Processed another 25 items, for a total of 11156\n",
+ "[INFO] Processed another 25 items, for a total of 11181\n",
+ "[INFO] Processed another 25 items, for a total of 11206\n",
+ "[INFO] Processed another 25 items, for a total of 11231\n",
+ "[INFO] Processed another 25 items, for a total of 11256\n",
+ "[INFO] Processed another 25 items, for a total of 11281\n",
+ "[INFO] Processed another 25 items, for a total of 11306\n",
+ "[INFO] Processed another 25 items, for a total of 11331\n",
+ "[INFO] Processed another 25 items, for a total of 11356\n",
+ "[INFO] Processed another 25 items, for a total of 11381\n",
+ "[INFO] Processed another 25 items, for a total of 11406\n",
+ "[INFO] Processed another 25 items, for a total of 11431\n",
+ "[INFO] Processed another 25 items, for a total of 11456\n",
+ "[INFO] Processed another 25 items, for a total of 11481\n",
+ "[INFO] Processed another 25 items, for a total of 11506\n",
+ "[INFO] Processed another 24 items, for a total of 11530\n",
+ "[INFO] Processed another 25 items, for a total of 11555\n",
+ "[INFO] Processed another 25 items, for a total of 11580\n",
+ "[INFO] Processed another 25 items, for a total of 11605\n",
+ "[INFO] Processed another 25 items, for a total of 11630\n",
+ "[INFO] Processed another 25 items, for a total of 11655\n",
+ "[INFO] Processed another 25 items, for a total of 11680\n",
+ "[INFO] Processed another 25 items, for a total of 11705\n",
+ "[INFO] Processed another 24 items, for a total of 11729\n",
+ "[INFO] Processed another 24 items, for a total of 11753\n",
+ "[INFO] Processed another 25 items, for a total of 11778\n",
+ "[INFO] Processed another 25 items, for a total of 11803\n",
+ "[INFO] Processed another 24 items, for a total of 11827\n",
+ "[INFO] Processed another 24 items, for a total of 11851\n",
+ "[INFO] Processed another 25 items, for a total of 11876\n",
+ "[INFO] Processed another 24 items, for a total of 11900\n",
+ "[INFO] Processed another 25 items, for a total of 11925\n",
+ "[INFO] Processed another 25 items, for a total of 11950\n",
+ "[INFO] Processed another 25 items, for a total of 11975\n",
+ "[INFO] Processed another 25 items, for a total of 12000\n",
+ "[INFO] Processed another 25 items, for a total of 12025\n",
+ "[INFO] Processed another 25 items, for a total of 12050\n",
+ "[INFO] Processed another 25 items, for a total of 12075\n",
+ "[INFO] Processed another 25 items, for a total of 12100\n",
+ "[INFO] Processed another 25 items, for a total of 12125\n",
+ "[INFO] Processed another 25 items, for a total of 12150\n",
+ "[INFO] Processed another 25 items, for a total of 12175\n",
+ "[INFO] Processed another 25 items, for a total of 12200\n",
+ "[INFO] Processed another 25 items, for a total of 12225\n",
+ "[INFO] Processed another 25 items, for a total of 12250\n",
+ "[INFO] Processed another 25 items, for a total of 12275\n",
+ "[INFO] Processed another 25 items, for a total of 12300\n",
+ "[INFO] Processed another 25 items, for a total of 12325\n",
+ "[INFO] Processed another 25 items, for a total of 12350\n",
+ "[INFO] Processed another 25 items, for a total of 12375\n",
+ "[INFO] Processed another 25 items, for a total of 12400\n",
+ "[INFO] Processed another 25 items, for a total of 12425\n",
+ "[INFO] Processed another 25 items, for a total of 12450\n",
+ "[INFO] Processed another 25 items, for a total of 12475\n",
+ "[INFO] Processed another 25 items, for a total of 12500\n",
+ "[INFO] Processed another 25 items, for a total of 12525\n",
+ "[INFO] Processed another 25 items, for a total of 12550\n",
+ "[INFO] Processed another 25 items, for a total of 12575\n",
+ "[INFO] Processed another 25 items, for a total of 12600\n",
+ "[INFO] Processed another 25 items, for a total of 12625\n",
+ "[INFO] Processed another 25 items, for a total of 12650\n",
+ "[INFO] Processed another 25 items, for a total of 12675\n",
+ "[INFO] Processed another 25 items, for a total of 12700\n",
+ "[INFO] Processed another 25 items, for a total of 12725\n",
+ "[INFO] Processed another 25 items, for a total of 12750\n",
+ "[INFO] Processed another 25 items, for a total of 12775\n",
+ "[INFO] Processed another 25 items, for a total of 12800\n",
+ "[INFO] Processed another 25 items, for a total of 12825\n",
+ "[INFO] Processed another 25 items, for a total of 12850\n",
+ "[INFO] Processed another 25 items, for a total of 12875\n",
+ "[INFO] Processed another 25 items, for a total of 12900\n",
+ "[INFO] Processed another 25 items, for a total of 12925\n",
+ "[INFO] Processed another 25 items, for a total of 12950\n",
+ "[INFO] Processed another 25 items, for a total of 12975\n",
+ "[INFO] Processed another 25 items, for a total of 13000\n",
+ "[INFO] Processed another 25 items, for a total of 13025\n",
+ "[INFO] Processed another 25 items, for a total of 13050\n",
+ "[INFO] Processed another 25 items, for a total of 13075\n",
+ "[INFO] Processed another 25 items, for a total of 13100\n",
+ "[INFO] Processed another 25 items, for a total of 13125\n",
+ "[INFO] Processed another 25 items, for a total of 13150\n",
+ "[INFO] Processed another 25 items, for a total of 13175\n",
+ "[INFO] Processed another 25 items, for a total of 13200\n",
+ "[INFO] Processed another 25 items, for a total of 13225\n",
+ "[INFO] Processed another 25 items, for a total of 13250\n",
+ "[INFO] Processed another 25 items, for a total of 13275\n",
+ "[INFO] Processed another 25 items, for a total of 13300\n",
+ "[INFO] Processed another 25 items, for a total of 13325\n",
+ "[INFO] Processed another 25 items, for a total of 13350\n",
+ "[INFO] Processed another 25 items, for a total of 13375\n",
+ "[INFO] Processed another 25 items, for a total of 13400\n",
+ "[INFO] Processed another 25 items, for a total of 13425\n",
+ "[INFO] Processed another 25 items, for a total of 13450\n",
+ "[INFO] Processed another 25 items, for a total of 13475\n",
+ "[INFO] Processed another 25 items, for a total of 13500\n",
+ "[INFO] Processed another 25 items, for a total of 13525\n",
+ "[INFO] Processed another 25 items, for a total of 13550\n",
+ "[INFO] Processed another 25 items, for a total of 13575\n",
+ "[INFO] Processed another 25 items, for a total of 13600\n",
+ "[INFO] Processed another 25 items, for a total of 13625\n",
+ "[INFO] Processed another 25 items, for a total of 13650\n",
+ "[INFO] Processed another 25 items, for a total of 13675\n",
+ "[INFO] Processed another 25 items, for a total of 13700\n",
+ "[INFO] Processed another 25 items, for a total of 13725\n",
+ "[INFO] Processed another 25 items, for a total of 13750\n",
+ "[INFO] Processed another 25 items, for a total of 13775\n",
+ "[INFO] Processed another 25 items, for a total of 13800\n",
+ "[INFO] Processed another 25 items, for a total of 13825\n",
+ "[INFO] Processed another 24 items, for a total of 13849\n",
+ "[INFO] Processed another 25 items, for a total of 13874\n",
+ "[INFO] Processed another 25 items, for a total of 13899\n",
+ "[INFO] Processed another 24 items, for a total of 13923\n",
+ "[INFO] Processed another 25 items, for a total of 13948\n",
+ "[INFO] Processed another 25 items, for a total of 13973\n",
+ "[INFO] Processed another 25 items, for a total of 13998\n",
+ "[INFO] Processed another 25 items, for a total of 14023\n",
+ "[INFO] Processed another 25 items, for a total of 14048\n",
+ "[INFO] Processed another 25 items, for a total of 14073\n",
+ "[INFO] Processed another 25 items, for a total of 14098\n",
+ "[INFO] Processed another 25 items, for a total of 14123\n",
+ "[INFO] Processed another 25 items, for a total of 14148\n",
+ "[INFO] Processed another 25 items, for a total of 14173\n",
+ "[INFO] Processed another 25 items, for a total of 14198\n",
+ "[INFO] Processed another 25 items, for a total of 14223\n",
+ "[INFO] Processed another 25 items, for a total of 14248\n",
+ "[INFO] Processed another 25 items, for a total of 14273\n",
+ "[INFO] Processed another 25 items, for a total of 14298\n",
+ "[INFO] Processed another 25 items, for a total of 14323\n",
+ "[INFO] Processed another 25 items, for a total of 14348\n",
+ "[INFO] Processed another 25 items, for a total of 14373\n",
+ "[INFO] Processed another 25 items, for a total of 14398\n",
+ "[INFO] Processed another 25 items, for a total of 14423\n",
+ "[INFO] Processed another 25 items, for a total of 14448\n",
+ "[INFO] Processed another 25 items, for a total of 14473\n",
+ "[INFO] Processed another 25 items, for a total of 14498\n",
+ "[INFO] Processed another 25 items, for a total of 14523\n",
+ "[INFO] Processed another 25 items, for a total of 14548\n",
+ "[INFO] Processed another 24 items, for a total of 14572\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 25 items, for a total of 14597\n",
+ "[INFO] Processed another 25 items, for a total of 14622\n",
+ "[INFO] Processed another 25 items, for a total of 14647\n",
+ "[INFO] Processed another 25 items, for a total of 14672\n",
+ "[INFO] Processed another 25 items, for a total of 14697\n",
+ "[INFO] Processed another 25 items, for a total of 14722\n",
+ "[INFO] Processed another 25 items, for a total of 14747\n",
+ "[INFO] Processed another 25 items, for a total of 14772\n",
+ "[INFO] Processed another 25 items, for a total of 14797\n",
+ "[INFO] Processed another 25 items, for a total of 14822\n",
+ "[INFO] Processed another 25 items, for a total of 14847\n",
+ "[INFO] Processed another 25 items, for a total of 14872\n",
+ "[INFO] Processed another 25 items, for a total of 14897\n",
+ "[INFO] Processed another 25 items, for a total of 14922\n",
+ "[INFO] Processed another 25 items, for a total of 14947\n",
+ "[INFO] Processed another 25 items, for a total of 14972\n",
+ "[INFO] Processed another 25 items, for a total of 14997\n",
+ "[INFO] Processed another 25 items, for a total of 15022\n",
+ "[INFO] Processed another 25 items, for a total of 15047\n",
+ "[INFO] Processed another 25 items, for a total of 15072\n",
+ "[INFO] Processed another 25 items, for a total of 15097\n",
+ "[INFO] Processed another 25 items, for a total of 15122\n",
+ "[INFO] Processed another 25 items, for a total of 15147\n",
+ "[INFO] Processed another 25 items, for a total of 15172\n",
+ "[INFO] Processed another 25 items, for a total of 15197\n",
+ "[INFO] Processed another 25 items, for a total of 15222\n",
+ "[INFO] Processed another 25 items, for a total of 15247\n",
+ "[INFO] Processed another 25 items, for a total of 15272\n",
+ "[INFO] Processed another 25 items, for a total of 15297\n",
+ "[INFO] Processed another 25 items, for a total of 15322\n",
+ "[INFO] Processed another 25 items, for a total of 15347\n",
+ "[INFO] Processed another 25 items, for a total of 15372\n",
+ "[INFO] Processed another 25 items, for a total of 15397\n",
+ "[INFO] Processed another 25 items, for a total of 15422\n",
+ "[INFO] Processed another 25 items, for a total of 15447\n",
+ "[INFO] Processed another 25 items, for a total of 15472\n",
+ "[INFO] Processed another 25 items, for a total of 15497\n",
+ "[INFO] Processed another 25 items, for a total of 15522\n",
+ "[INFO] Processed another 25 items, for a total of 15547\n",
+ "[INFO] Processed another 25 items, for a total of 15572\n",
+ "[INFO] Processed another 25 items, for a total of 15597\n",
+ "[INFO] Processed another 25 items, for a total of 15622\n",
+ "[INFO] Processed another 25 items, for a total of 15647\n",
+ "[INFO] Processed another 25 items, for a total of 15672\n",
+ "[INFO] Processed another 25 items, for a total of 15697\n",
+ "[INFO] Processed another 25 items, for a total of 15722\n",
+ "[INFO] Processed another 25 items, for a total of 15747\n",
+ "[INFO] Processed another 25 items, for a total of 15772\n",
+ "[INFO] Processed another 25 items, for a total of 15797\n",
+ "[INFO] Processed another 25 items, for a total of 15822\n",
+ "[INFO] Processed another 25 items, for a total of 15847\n",
+ "[INFO] Processed another 25 items, for a total of 15872\n",
+ "[INFO] Processed another 25 items, for a total of 15897\n",
+ "[INFO] Processed another 25 items, for a total of 15922\n",
+ "[INFO] Processed another 25 items, for a total of 15947\n",
+ "[INFO] Processed another 25 items, for a total of 15972\n",
+ "[INFO] Processed another 25 items, for a total of 15997\n",
+ "[INFO] Processed another 25 items, for a total of 16022\n",
+ "[INFO] Processed another 25 items, for a total of 16047\n",
+ "[INFO] Processed another 25 items, for a total of 16072\n",
+ "[INFO] Processed another 25 items, for a total of 16097\n",
+ "[INFO] Processed another 25 items, for a total of 16122\n",
+ "[INFO] Processed another 25 items, for a total of 16147\n",
+ "[INFO] Processed another 25 items, for a total of 16172\n",
+ "[INFO] Processed another 25 items, for a total of 16197\n",
+ "[INFO] Processed another 25 items, for a total of 16222\n",
+ "[INFO] Processed another 25 items, for a total of 16247\n",
+ "[INFO] Processed another 25 items, for a total of 16272\n",
+ "[INFO] Processed another 25 items, for a total of 16297\n",
+ "[INFO] Processed another 24 items, for a total of 16321\n",
+ "[INFO] Processed another 24 items, for a total of 16345\n",
+ "[INFO] Processed another 25 items, for a total of 16370\n",
+ "[INFO] Processed another 24 items, for a total of 16394\n",
+ "[INFO] Processed another 25 items, for a total of 16419\n",
+ "[INFO] Processed another 25 items, for a total of 16444\n",
+ "[INFO] Processed another 25 items, for a total of 16469\n",
+ "[INFO] Processed another 25 items, for a total of 16494\n",
+ "[INFO] Processed another 25 items, for a total of 16519\n",
+ "[INFO] Processed another 25 items, for a total of 16544\n",
+ "[INFO] Processed another 25 items, for a total of 16569\n",
+ "[INFO] Processed another 25 items, for a total of 16594\n",
+ "[INFO] Processed another 25 items, for a total of 16619\n",
+ "[INFO] Processed another 25 items, for a total of 16644\n",
+ "[INFO] Processed another 25 items, for a total of 16669\n",
+ "[INFO] Processed another 25 items, for a total of 16694\n",
+ "[INFO] Processed another 25 items, for a total of 16719\n",
+ "[INFO] Processed another 25 items, for a total of 16744\n",
+ "[INFO] Processed another 25 items, for a total of 16769\n",
+ "[INFO] Processed another 25 items, for a total of 16794\n",
+ "[INFO] Processed another 25 items, for a total of 16819\n",
+ "[INFO] Processed another 25 items, for a total of 16844\n",
+ "[INFO] Processed another 25 items, for a total of 16869\n",
+ "[INFO] Processed another 25 items, for a total of 16894\n",
+ "[INFO] Processed another 25 items, for a total of 16919\n",
+ "[INFO] Processed another 25 items, for a total of 16944\n",
+ "[INFO] Processed another 25 items, for a total of 16969\n",
+ "[INFO] Processed another 25 items, for a total of 16994\n",
+ "[INFO] Processed another 25 items, for a total of 17019\n",
+ "[INFO] Processed another 25 items, for a total of 17044\n",
+ "[INFO] Processed another 25 items, for a total of 17069\n",
+ "[INFO] Processed another 25 items, for a total of 17094\n",
+ "[INFO] Processed another 25 items, for a total of 17119\n",
+ "[INFO] Processed another 25 items, for a total of 17144\n",
+ "[INFO] Processed another 25 items, for a total of 17169\n",
+ "[INFO] Processed another 25 items, for a total of 17194\n",
+ "[INFO] Processed another 25 items, for a total of 17219\n",
+ "[INFO] Processed another 25 items, for a total of 17244\n",
+ "[INFO] Processed another 25 items, for a total of 17269\n",
+ "[INFO] Processed another 24 items, for a total of 17293\n",
+ "[INFO] Processed another 25 items, for a total of 17318\n",
+ "[INFO] Processed another 23 items, for a total of 17341\n",
+ "[INFO] Processed another 23 items, for a total of 17364\n",
+ "[INFO] Processed another 25 items, for a total of 17389\n",
+ "[INFO] Processed another 23 items, for a total of 17412\n",
+ "[INFO] Processed another 24 items, for a total of 17436\n",
+ "[INFO] Processed another 21 items, for a total of 17457\n",
+ "[INFO] Processed another 25 items, for a total of 17482\n",
+ "[INFO] Processed another 23 items, for a total of 17505\n",
+ "[INFO] Processed another 25 items, for a total of 17530\n",
+ "[INFO] Processed another 24 items, for a total of 17554\n",
+ "[INFO] Processed another 25 items, for a total of 17579\n",
+ "[INFO] Processed another 25 items, for a total of 17604\n",
+ "[INFO] Processed another 25 items, for a total of 17629\n",
+ "[INFO] Processed another 24 items, for a total of 17653\n",
+ "[INFO] Processed another 25 items, for a total of 17678\n",
+ "[INFO] Processed another 25 items, for a total of 17703\n",
+ "[INFO] Processed another 25 items, for a total of 17728\n",
+ "[INFO] Processed another 25 items, for a total of 17753\n",
+ "[INFO] Processed another 25 items, for a total of 17778\n",
+ "[INFO] Processed another 25 items, for a total of 17803\n",
+ "[INFO] Processed another 25 items, for a total of 17828\n",
+ "[INFO] Processed another 25 items, for a total of 17853\n",
+ "[INFO] Processed another 25 items, for a total of 17878\n",
+ "[INFO] Processed another 25 items, for a total of 17903\n",
+ "[INFO] Processed another 25 items, for a total of 17928\n",
+ "[INFO] Processed another 25 items, for a total of 17953\n",
+ "[INFO] Processed another 25 items, for a total of 17978\n",
+ "[INFO] Processed another 25 items, for a total of 18003\n",
+ "[INFO] Processed another 25 items, for a total of 18028\n",
+ "[INFO] Processed another 25 items, for a total of 18053\n",
+ "[INFO] Processed another 25 items, for a total of 18078\n",
+ "[INFO] Processed another 25 items, for a total of 18103\n",
+ "[INFO] Processed another 25 items, for a total of 18128\n",
+ "[INFO] Processed another 25 items, for a total of 18153\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 24 items, for a total of 18177\n",
+ "[INFO] Processed another 25 items, for a total of 18202\n",
+ "[INFO] Processed another 25 items, for a total of 18227\n",
+ "[INFO] Processed another 25 items, for a total of 18252\n",
+ "[INFO] Processed another 25 items, for a total of 18277\n",
+ "[INFO] Processed another 25 items, for a total of 18302\n",
+ "[INFO] Processed another 25 items, for a total of 18327\n",
+ "[INFO] Processed another 25 items, for a total of 18352\n",
+ "[INFO] Processed another 25 items, for a total of 18377\n",
+ "[INFO] Processed another 25 items, for a total of 18402\n",
+ "[INFO] Processed another 25 items, for a total of 18427\n",
+ "[INFO] Processed another 25 items, for a total of 18452\n",
+ "[INFO] Processed another 25 items, for a total of 18477\n",
+ "[INFO] Processed another 25 items, for a total of 18502\n",
+ "[INFO] Processed another 25 items, for a total of 18527\n",
+ "[INFO] Processed another 25 items, for a total of 18552\n",
+ "[INFO] Processed another 25 items, for a total of 18577\n",
+ "[INFO] Processed another 25 items, for a total of 18602\n",
+ "[INFO] Processed another 25 items, for a total of 18627\n",
+ "[INFO] Processed another 25 items, for a total of 18652\n",
+ "[INFO] Processed another 25 items, for a total of 18677\n",
+ "[INFO] Processed another 24 items, for a total of 18701\n",
+ "[INFO] Processed another 25 items, for a total of 18726\n",
+ "[INFO] Processed another 24 items, for a total of 18750\n",
+ "[INFO] Processed another 25 items, for a total of 18775\n",
+ "[INFO] Processed another 25 items, for a total of 18800\n",
+ "[INFO] Processed another 25 items, for a total of 18825\n",
+ "[INFO] Processed another 25 items, for a total of 18850\n",
+ "[INFO] Processed another 25 items, for a total of 18875\n",
+ "[INFO] Processed another 25 items, for a total of 18900\n",
+ "[INFO] Processed another 25 items, for a total of 18925\n",
+ "[INFO] Processed another 25 items, for a total of 18950\n",
+ "[INFO] Processed another 25 items, for a total of 18975\n",
+ "[INFO] Processed another 25 items, for a total of 19000\n",
+ "[INFO] Processed another 25 items, for a total of 19025\n",
+ "[INFO] Processed another 25 items, for a total of 19050\n",
+ "[INFO] Processed another 25 items, for a total of 19075\n",
+ "[INFO] Processed another 25 items, for a total of 19100\n",
+ "[INFO] Processed another 25 items, for a total of 19125\n",
+ "[INFO] Processed another 25 items, for a total of 19150\n",
+ "[INFO] Processed another 25 items, for a total of 19175\n",
+ "[INFO] Processed another 25 items, for a total of 19200\n",
+ "[INFO] Processed another 25 items, for a total of 19225\n",
+ "[INFO] Processed another 25 items, for a total of 19250\n",
+ "[INFO] Processed another 25 items, for a total of 19275\n",
+ "[INFO] Processed another 25 items, for a total of 19300\n",
+ "[INFO] Processed another 25 items, for a total of 19325\n",
+ "[INFO] Processed another 25 items, for a total of 19350\n",
+ "[INFO] Processed another 25 items, for a total of 19375\n",
+ "[INFO] Processed another 25 items, for a total of 19400\n",
+ "[INFO] Processed another 25 items, for a total of 19425\n",
+ "[INFO] Processed another 25 items, for a total of 19450\n",
+ "[INFO] Processed another 25 items, for a total of 19475\n",
+ "[INFO] Processed another 25 items, for a total of 19500\n",
+ "[INFO] Processed another 25 items, for a total of 19525\n",
+ "[INFO] Processed another 24 items, for a total of 19549\n",
+ "[INFO] Processed another 24 items, for a total of 19573\n",
+ "[INFO] Processed another 25 items, for a total of 19598\n",
+ "[INFO] Processed another 25 items, for a total of 19623\n",
+ "[INFO] Processed another 25 items, for a total of 19648\n",
+ "[INFO] Processed another 25 items, for a total of 19673\n",
+ "[INFO] Processed another 25 items, for a total of 19698\n",
+ "[INFO] Processed another 25 items, for a total of 19723\n",
+ "[INFO] Processed another 25 items, for a total of 19748\n",
+ "[INFO] Processed another 25 items, for a total of 19773\n",
+ "[INFO] Processed another 25 items, for a total of 19798\n",
+ "[INFO] Processed another 25 items, for a total of 19823\n",
+ "[INFO] Processed another 25 items, for a total of 19848\n",
+ "[INFO] Processed another 25 items, for a total of 19873\n",
+ "[INFO] Processed another 25 items, for a total of 19898\n",
+ "[INFO] Processed another 25 items, for a total of 19923\n",
+ "[INFO] Processed another 25 items, for a total of 19948\n",
+ "[INFO] Processed another 25 items, for a total of 19973\n",
+ "[INFO] Processed another 25 items, for a total of 19998\n",
+ "[INFO] Processed another 25 items, for a total of 20023\n",
+ "[INFO] Processed another 2 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n",
+ "[INFO] Processed another 0 items, for a total of 20025\n"
+ ]
+ }
+ ],
+ "source": [
+ "def scrap_anecdotica(soup, _):\n",
+ " jokes = soup.findAll('div', 'item_text')\n",
+ " return [' '.join(process_block(joke)) for joke in jokes]\n",
+ "\n",
+ "def urls_anecdotica(n_batches, _):\n",
+ " URL = 'http://anecdotica.ru/all/{}'\n",
+ " for i in range(1, n_batches + 1):\n",
+ " yield URL.format(i)\n",
+ "\n",
+ "anecdotica = scrap_page(scrap_anecdotica, urls_anecdotica, n_batches=1000)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# pd.DataFrame(anecdotica, columns=['Text']).to_csv('../data/anecdotika.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### https://www.anekdot.ru/"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "958487d9ce9b49d6ac5718380b711826",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=1248000.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9501/x950133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9502/x950233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9503/x950333;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9504/x950433;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9505/x950533;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9506/x950633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9507/x950733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9508/x950833;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9509/x950933;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9510/x951033;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9511/x951133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9512/x951233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9601/x960133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9602/x960233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9603/x960333;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9604/x960433;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9605/x960533;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9606/x960633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9607/x960733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9608/x960833;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9609/x960933;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9610/x961033;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9611/x961133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9612/x961233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9701/x970133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9702/x970233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9703/x970333;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9704/x970433;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9705/x970533;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9706/x970633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9707/x970733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9708/x970833;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9709/x970933;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9710/x971033;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9711/x971133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9712/x971233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9801/x980133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9802/x980233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9803/x980333;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9804/x980433;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9805/x980533;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9806/x980633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9807/x980733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9808/x980833;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9809/x980933;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9810/x981033;0,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 101\n",
+ "[INFO] Processed another 101 items, for a total of 202\n",
+ "[INFO] Processed another 68 items, for a total of 270\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9811/x981133;300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 371\n",
+ "[INFO] Processed another 101 items, for a total of 472\n",
+ "[INFO] Processed another 85 items, for a total of 557\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9812/x981233;300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 658\n",
+ "[INFO] Processed another 101 items, for a total of 759\n",
+ "[INFO] Processed another 101 items, for a total of 860\n",
+ "[INFO] Processed another 45 items, for a total of 905\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9901/x990133;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 1006\n",
+ "[INFO] Processed another 101 items, for a total of 1107\n",
+ "[INFO] Processed another 101 items, for a total of 1208\n",
+ "[INFO] Processed another 47 items, for a total of 1255\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9902/x990233;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 1356\n",
+ "[INFO] Processed another 101 items, for a total of 1457\n",
+ "[INFO] Processed another 101 items, for a total of 1558\n",
+ "[INFO] Processed another 62 items, for a total of 1620\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9903/x990333;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 1721\n",
+ "[INFO] Processed another 101 items, for a total of 1822\n",
+ "[INFO] Processed another 101 items, for a total of 1923\n",
+ "[INFO] Processed another 101 items, for a total of 2024\n",
+ "[INFO] Processed another 10 items, for a total of 2034\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9904/x990433;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 2135\n",
+ "[INFO] Processed another 101 items, for a total of 2236\n",
+ "[INFO] Processed another 101 items, for a total of 2337\n",
+ "[INFO] Processed another 92 items, for a total of 2429\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9905/x990533;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 2530\n",
+ "[INFO] Processed another 101 items, for a total of 2631\n",
+ "[INFO] Processed another 92 items, for a total of 2723\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9906/x990633;300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 2824\n",
+ "[INFO] Processed another 101 items, for a total of 2925\n",
+ "[INFO] Processed another 101 items, for a total of 3026\n",
+ "[INFO] Processed another 69 items, for a total of 3095\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9907/x990733;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 3196\n",
+ "[INFO] Processed another 101 items, for a total of 3297\n",
+ "[INFO] Processed another 101 items, for a total of 3398\n",
+ "[INFO] Processed another 4 items, for a total of 3402\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9908/x990833;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 3503\n",
+ "[INFO] Processed another 101 items, for a total of 3604\n",
+ "[INFO] Processed another 69 items, for a total of 3673\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9909/x990933;300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 3774\n",
+ "[INFO] Processed another 101 items, for a total of 3875\n",
+ "[INFO] Processed another 101 items, for a total of 3976\n",
+ "[INFO] Processed another 101 items, for a total of 4077\n",
+ "[INFO] Processed another 33 items, for a total of 4110\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9910/x991033;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 4211\n",
+ "[INFO] Processed another 101 items, for a total of 4312\n",
+ "[INFO] Processed another 101 items, for a total of 4413\n",
+ "[INFO] Processed another 101 items, for a total of 4514\n",
+ "[INFO] Processed another 19 items, for a total of 4533\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9911/x991133;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 4634\n",
+ "[INFO] Processed another 101 items, for a total of 4735\n",
+ "[INFO] Processed another 101 items, for a total of 4836\n",
+ "[INFO] Processed another 70 items, for a total of 4906\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an9912/x991233;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 5007\n",
+ "[INFO] Processed another 101 items, for a total of 5108\n",
+ "[INFO] Processed another 101 items, for a total of 5209\n",
+ "[INFO] Processed another 101 items, for a total of 5310\n",
+ "[INFO] Processed another 101 items, for a total of 5411\n",
+ "[INFO] Processed another 54 items, for a total of 5465\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0001/x000133;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 5566\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 5667\n",
+ "[INFO] Processed another 101 items, for a total of 5768\n",
+ "[INFO] Processed another 101 items, for a total of 5869\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0002/x000233;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 5970\n",
+ "[INFO] Processed another 101 items, for a total of 6071\n",
+ "[INFO] Processed another 101 items, for a total of 6172\n",
+ "[INFO] Processed another 86 items, for a total of 6258\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0003/x000333;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 6359\n",
+ "[INFO] Processed another 101 items, for a total of 6460\n",
+ "[INFO] Processed another 101 items, for a total of 6561\n",
+ "[INFO] Processed another 101 items, for a total of 6662\n",
+ "[INFO] Processed another 6 items, for a total of 6668\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0004/x000433;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 6769\n",
+ "[INFO] Processed another 101 items, for a total of 6870\n",
+ "[INFO] Processed another 98 items, for a total of 6968\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0005/x000533;300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 7069\n",
+ "[INFO] Processed another 101 items, for a total of 7170\n",
+ "[INFO] Processed another 101 items, for a total of 7271\n",
+ "[INFO] Processed another 55 items, for a total of 7326\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0006/x000633;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 7427\n",
+ "[INFO] Processed another 101 items, for a total of 7528\n",
+ "[INFO] Processed another 101 items, for a total of 7629\n",
+ "[INFO] Processed another 39 items, for a total of 7668\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0007/x000733;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 7769\n",
+ "[INFO] Processed another 101 items, for a total of 7870\n",
+ "[INFO] Processed another 101 items, for a total of 7971\n",
+ "[INFO] Processed another 39 items, for a total of 8010\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0008/x000833;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 8111\n",
+ "[INFO] Processed another 101 items, for a total of 8212\n",
+ "[INFO] Processed another 101 items, for a total of 8313\n",
+ "[INFO] Processed another 76 items, for a total of 8389\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0009/x000933;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 8490\n",
+ "[INFO] Processed another 101 items, for a total of 8591\n",
+ "[INFO] Processed another 101 items, for a total of 8692\n",
+ "[INFO] Processed another 101 items, for a total of 8793\n",
+ "[INFO] Processed another 27 items, for a total of 8820\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0010/x001033;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 8921\n",
+ "[INFO] Processed another 101 items, for a total of 9022\n",
+ "[INFO] Processed another 101 items, for a total of 9123\n",
+ "[INFO] Processed another 85 items, for a total of 9208\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0011/x001133;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 9309\n",
+ "[INFO] Processed another 101 items, for a total of 9410\n",
+ "[INFO] Processed another 101 items, for a total of 9511\n",
+ "[INFO] Processed another 101 items, for a total of 9612\n",
+ "[INFO] Processed another 75 items, for a total of 9687\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0012/x001233;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 9788\n",
+ "[INFO] Processed another 101 items, for a total of 9889\n",
+ "[INFO] Processed another 101 items, for a total of 9990\n",
+ "[INFO] Processed another 32 items, for a total of 10022\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0101/x010133;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 10123\n",
+ "[INFO] Processed another 101 items, for a total of 10224\n",
+ "[INFO] Processed another 101 items, for a total of 10325\n",
+ "[INFO] Processed another 101 items, for a total of 10426\n",
+ "[INFO] Processed another 5 items, for a total of 10431\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0102/x010233;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 10532\n",
+ "[INFO] Processed another 101 items, for a total of 10633\n",
+ "[INFO] Processed another 101 items, for a total of 10734\n",
+ "[INFO] Processed another 101 items, for a total of 10835\n",
+ "[INFO] Processed another 101 items, for a total of 10936\n",
+ "[INFO] Processed another 7 items, for a total of 10943\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0103/x010333;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 11044\n",
+ "[INFO] Processed another 101 items, for a total of 11145\n",
+ "[INFO] Processed another 101 items, for a total of 11246\n",
+ "[INFO] Processed another 101 items, for a total of 11347\n",
+ "[INFO] Processed another 101 items, for a total of 11448\n",
+ "[INFO] Processed another 101 items, for a total of 11549\n",
+ "[INFO] Processed another 101 items, for a total of 11650\n",
+ "[INFO] Processed another 93 items, for a total of 11743\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0104/x010433;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 11844\n",
+ "[INFO] Processed another 101 items, for a total of 11945\n",
+ "[INFO] Processed another 101 items, for a total of 12046\n",
+ "[INFO] Processed another 101 items, for a total of 12147\n",
+ "[INFO] Processed another 101 items, for a total of 12248\n",
+ "[INFO] Processed another 70 items, for a total of 12318\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0105/x010533;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 12419\n",
+ "[INFO] Processed another 101 items, for a total of 12520\n",
+ "[INFO] Processed another 101 items, for a total of 12621\n",
+ "[INFO] Processed another 101 items, for a total of 12722\n",
+ "[INFO] Processed another 8 items, for a total of 12730\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0106/x010633;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 12831\n",
+ "[INFO] Processed another 101 items, for a total of 12932\n",
+ "[INFO] Processed another 101 items, for a total of 13033\n",
+ "[INFO] Processed another 101 items, for a total of 13134\n",
+ "[INFO] Processed another 101 items, for a total of 13235\n",
+ "[INFO] Processed another 34 items, for a total of 13269\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0107/x010733;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 13370\n",
+ "[INFO] Processed another 101 items, for a total of 13471\n",
+ "[INFO] Processed another 101 items, for a total of 13572\n",
+ "[INFO] Processed another 101 items, for a total of 13673\n",
+ "[INFO] Processed another 101 items, for a total of 13774\n",
+ "[INFO] Processed another 67 items, for a total of 13841\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0108/x010833;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 13942\n",
+ "[INFO] Processed another 101 items, for a total of 14043\n",
+ "[INFO] Processed another 101 items, for a total of 14144\n",
+ "[INFO] Processed another 101 items, for a total of 14245\n",
+ "[INFO] Processed another 101 items, for a total of 14346\n",
+ "[INFO] Processed another 95 items, for a total of 14441\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0109/x010933;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 14542\n",
+ "[INFO] Processed another 101 items, for a total of 14643\n",
+ "[INFO] Processed another 101 items, for a total of 14744\n",
+ "[INFO] Processed another 101 items, for a total of 14845\n",
+ "[INFO] Processed another 101 items, for a total of 14946\n",
+ "[INFO] Processed another 64 items, for a total of 15010\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0110/x011033;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 15111\n",
+ "[INFO] Processed another 101 items, for a total of 15212\n",
+ "[INFO] Processed another 101 items, for a total of 15313\n",
+ "[INFO] Processed another 101 items, for a total of 15414\n",
+ "[INFO] Processed another 101 items, for a total of 15515\n",
+ "[INFO] Processed another 101 items, for a total of 15616\n",
+ "[INFO] Processed another 16 items, for a total of 15632\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0111/x011133;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 15733\n",
+ "[INFO] Processed another 101 items, for a total of 15834\n",
+ "[INFO] Processed another 101 items, for a total of 15935\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 16036\n",
+ "[INFO] Processed another 99 items, for a total of 16135\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0112/x011233;500,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0201/x020133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0202/x020233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0203/x020333;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0204/x020433;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0205/x020533;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0206/x020633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0207/x020733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0208/x020833;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0209/x020933;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0210/x021033;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0211/x021133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0212/x021233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0301/x030133;0,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 16236\n",
+ "[INFO] Processed another 101 items, for a total of 16337\n",
+ "[INFO] Processed another 101 items, for a total of 16438\n",
+ "[INFO] Processed another 101 items, for a total of 16539\n",
+ "[INFO] Processed another 101 items, for a total of 16640\n",
+ "[INFO] Processed another 43 items, for a total of 16683\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0302/x030233;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 16784\n",
+ "[INFO] Processed another 101 items, for a total of 16885\n",
+ "[INFO] Processed another 101 items, for a total of 16986\n",
+ "[INFO] Processed another 101 items, for a total of 17087\n",
+ "[INFO] Processed another 18 items, for a total of 17105\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0303/x030333;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 17206\n",
+ "[INFO] Processed another 101 items, for a total of 17307\n",
+ "[INFO] Processed another 101 items, for a total of 17408\n",
+ "[INFO] Processed another 101 items, for a total of 17509\n",
+ "[INFO] Processed another 43 items, for a total of 17552\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0304/x030433;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 17653\n",
+ "[INFO] Processed another 101 items, for a total of 17754\n",
+ "[INFO] Processed another 101 items, for a total of 17855\n",
+ "[INFO] Processed another 101 items, for a total of 17956\n",
+ "[INFO] Processed another 42 items, for a total of 17998\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0305/x030533;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 18099\n",
+ "[INFO] Processed another 101 items, for a total of 18200\n",
+ "[INFO] Processed another 101 items, for a total of 18301\n",
+ "[INFO] Processed another 101 items, for a total of 18402\n",
+ "[INFO] Processed another 73 items, for a total of 18475\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0306/x030633;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 18576\n",
+ "[INFO] Processed another 101 items, for a total of 18677\n",
+ "[INFO] Processed another 101 items, for a total of 18778\n",
+ "[INFO] Processed another 101 items, for a total of 18879\n",
+ "[INFO] Processed another 101 items, for a total of 18980\n",
+ "[INFO] Processed another 40 items, for a total of 19020\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0307/x030733;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 19121\n",
+ "[INFO] Processed another 101 items, for a total of 19222\n",
+ "[INFO] Processed another 101 items, for a total of 19323\n",
+ "[INFO] Processed another 36 items, for a total of 19359\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0308/x030833;400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 19460\n",
+ "[INFO] Processed another 101 items, for a total of 19561\n",
+ "[INFO] Processed another 101 items, for a total of 19662\n",
+ "[INFO] Processed another 101 items, for a total of 19763\n",
+ "[INFO] Processed another 66 items, for a total of 19829\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0309/x030933;500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 19930\n",
+ "[INFO] Processed another 101 items, for a total of 20031\n",
+ "[INFO] Processed another 101 items, for a total of 20132\n",
+ "[INFO] Processed another 101 items, for a total of 20233\n",
+ "[INFO] Processed another 101 items, for a total of 20334\n",
+ "[INFO] Processed another 90 items, for a total of 20424\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0310/x031033;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 20525\n",
+ "[INFO] Processed another 101 items, for a total of 20626\n",
+ "[INFO] Processed another 101 items, for a total of 20727\n",
+ "[INFO] Processed another 101 items, for a total of 20828\n",
+ "[INFO] Processed another 101 items, for a total of 20929\n",
+ "[INFO] Processed another 101 items, for a total of 21030\n",
+ "[INFO] Processed another 54 items, for a total of 21084\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0311/x031133;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 21185\n",
+ "[INFO] Processed another 101 items, for a total of 21286\n",
+ "[INFO] Processed another 101 items, for a total of 21387\n",
+ "[INFO] Processed another 101 items, for a total of 21488\n",
+ "[INFO] Processed another 101 items, for a total of 21589\n",
+ "[INFO] Processed another 50 items, for a total of 21639\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0312/x031233;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 21740\n",
+ "[INFO] Processed another 101 items, for a total of 21841\n",
+ "[INFO] Processed another 101 items, for a total of 21942\n",
+ "[INFO] Processed another 101 items, for a total of 22043\n",
+ "[INFO] Processed another 101 items, for a total of 22144\n",
+ "[INFO] Processed another 100 items, for a total of 22244\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0401/x040133;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 22345\n",
+ "[INFO] Processed another 101 items, for a total of 22446\n",
+ "[INFO] Processed another 101 items, for a total of 22547\n",
+ "[INFO] Processed another 101 items, for a total of 22648\n",
+ "[INFO] Processed another 101 items, for a total of 22749\n",
+ "[INFO] Processed another 101 items, for a total of 22850\n",
+ "[INFO] Processed another 70 items, for a total of 22920\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0402/x040233;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 23021\n",
+ "[INFO] Processed another 101 items, for a total of 23122\n",
+ "[INFO] Processed another 101 items, for a total of 23223\n",
+ "[INFO] Processed another 101 items, for a total of 23324\n",
+ "[INFO] Processed another 101 items, for a total of 23425\n",
+ "[INFO] Processed another 101 items, for a total of 23526\n",
+ "[INFO] Processed another 101 items, for a total of 23627\n",
+ "[INFO] Processed another 31 items, for a total of 23658\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0403/x040333;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 23759\n",
+ "[INFO] Processed another 101 items, for a total of 23860\n",
+ "[INFO] Processed another 101 items, for a total of 23961\n",
+ "[INFO] Processed another 101 items, for a total of 24062\n",
+ "[INFO] Processed another 101 items, for a total of 24163\n",
+ "[INFO] Processed another 84 items, for a total of 24247\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0404/x040433;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 24348\n",
+ "[INFO] Processed another 101 items, for a total of 24449\n",
+ "[INFO] Processed another 101 items, for a total of 24550\n",
+ "[INFO] Processed another 101 items, for a total of 24651\n",
+ "[INFO] Processed another 101 items, for a total of 24752\n",
+ "[INFO] Processed another 101 items, for a total of 24853\n",
+ "[INFO] Processed another 49 items, for a total of 24902\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0405/x040533;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 25003\n",
+ "[INFO] Processed another 101 items, for a total of 25104\n",
+ "[INFO] Processed another 101 items, for a total of 25205\n",
+ "[INFO] Processed another 101 items, for a total of 25306\n",
+ "[INFO] Processed another 101 items, for a total of 25407\n",
+ "[INFO] Processed another 101 items, for a total of 25508\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 25609\n",
+ "[INFO] Processed another 59 items, for a total of 25668\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0406/x040633;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 25769\n",
+ "[INFO] Processed another 101 items, for a total of 25870\n",
+ "[INFO] Processed another 101 items, for a total of 25971\n",
+ "[INFO] Processed another 101 items, for a total of 26072\n",
+ "[INFO] Processed another 101 items, for a total of 26173\n",
+ "[INFO] Processed another 101 items, for a total of 26274\n",
+ "[INFO] Processed another 79 items, for a total of 26353\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0407/x040733;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 26454\n",
+ "[INFO] Processed another 101 items, for a total of 26555\n",
+ "[INFO] Processed another 101 items, for a total of 26656\n",
+ "[INFO] Processed another 101 items, for a total of 26757\n",
+ "[INFO] Processed another 101 items, for a total of 26858\n",
+ "[INFO] Processed another 101 items, for a total of 26959\n",
+ "[INFO] Processed another 101 items, for a total of 27060\n",
+ "[INFO] Processed another 56 items, for a total of 27116\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0408/x040833;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 27217\n",
+ "[INFO] Processed another 101 items, for a total of 27318\n",
+ "[INFO] Processed another 101 items, for a total of 27419\n",
+ "[INFO] Processed another 101 items, for a total of 27520\n",
+ "[INFO] Processed another 101 items, for a total of 27621\n",
+ "[INFO] Processed another 101 items, for a total of 27722\n",
+ "[INFO] Processed another 101 items, for a total of 27823\n",
+ "[INFO] Processed another 101 items, for a total of 27924\n",
+ "[INFO] Processed another 26 items, for a total of 27950\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0409/x040933;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 28051\n",
+ "[INFO] Processed another 101 items, for a total of 28152\n",
+ "[INFO] Processed another 101 items, for a total of 28253\n",
+ "[INFO] Processed another 101 items, for a total of 28354\n",
+ "[INFO] Processed another 101 items, for a total of 28455\n",
+ "[INFO] Processed another 101 items, for a total of 28556\n",
+ "[INFO] Processed another 101 items, for a total of 28657\n",
+ "[INFO] Processed another 101 items, for a total of 28758\n",
+ "[INFO] Processed another 39 items, for a total of 28797\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0410/x041033;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 28898\n",
+ "[INFO] Processed another 101 items, for a total of 28999\n",
+ "[INFO] Processed another 101 items, for a total of 29100\n",
+ "[INFO] Processed another 101 items, for a total of 29201\n",
+ "[INFO] Processed another 101 items, for a total of 29302\n",
+ "[INFO] Processed another 101 items, for a total of 29403\n",
+ "[INFO] Processed another 78 items, for a total of 29481\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0411/x041133;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 29582\n",
+ "[INFO] Processed another 101 items, for a total of 29683\n",
+ "[INFO] Processed another 101 items, for a total of 29784\n",
+ "[INFO] Processed another 101 items, for a total of 29885\n",
+ "[INFO] Processed another 101 items, for a total of 29986\n",
+ "[INFO] Processed another 101 items, for a total of 30087\n",
+ "[INFO] Processed another 56 items, for a total of 30143\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0412/x041233;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 30244\n",
+ "[INFO] Processed another 101 items, for a total of 30345\n",
+ "[INFO] Processed another 101 items, for a total of 30446\n",
+ "[INFO] Processed another 101 items, for a total of 30547\n",
+ "[INFO] Processed another 101 items, for a total of 30648\n",
+ "[INFO] Processed another 101 items, for a total of 30749\n",
+ "[INFO] Processed another 101 items, for a total of 30850\n",
+ "[INFO] Processed another 88 items, for a total of 30938\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0501/x050133;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 31039\n",
+ "[INFO] Processed another 101 items, for a total of 31140\n",
+ "[INFO] Processed another 101 items, for a total of 31241\n",
+ "[INFO] Processed another 101 items, for a total of 31342\n",
+ "[INFO] Processed another 76 items, for a total of 31418\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0502/x050233;500,100.html\n",
+ "[INFO] Processed another 58 items, for a total of 31476\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0503/x050333;100,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0504/x050433;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0505/x050533;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0506/x050633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0507/x050733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0508/x050833;0,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 31577\n",
+ "[INFO] Processed another 101 items, for a total of 31678\n",
+ "[INFO] Processed another 101 items, for a total of 31779\n",
+ "[INFO] Processed another 101 items, for a total of 31880\n",
+ "[INFO] Processed another 101 items, for a total of 31981\n",
+ "[INFO] Processed another 101 items, for a total of 32082\n",
+ "[INFO] Processed another 101 items, for a total of 32183\n",
+ "[INFO] Processed another 101 items, for a total of 32284\n",
+ "[INFO] Processed another 101 items, for a total of 32385\n",
+ "[INFO] Processed another 101 items, for a total of 32486\n",
+ "[INFO] Processed another 101 items, for a total of 32587\n",
+ "[INFO] Processed another 11 items, for a total of 32598\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0509/x050933;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 32699\n",
+ "[INFO] Processed another 101 items, for a total of 32800\n",
+ "[INFO] Processed another 101 items, for a total of 32901\n",
+ "[INFO] Processed another 101 items, for a total of 33002\n",
+ "[INFO] Processed another 101 items, for a total of 33103\n",
+ "[INFO] Processed another 101 items, for a total of 33204\n",
+ "[INFO] Processed another 101 items, for a total of 33305\n",
+ "[INFO] Processed another 101 items, for a total of 33406\n",
+ "[INFO] Processed another 101 items, for a total of 33507\n",
+ "[INFO] Processed another 90 items, for a total of 33597\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0510/x051033;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 33698\n",
+ "[INFO] Processed another 101 items, for a total of 33799\n",
+ "[INFO] Processed another 101 items, for a total of 33900\n",
+ "[INFO] Processed another 101 items, for a total of 34001\n",
+ "[INFO] Processed another 101 items, for a total of 34102\n",
+ "[INFO] Processed another 101 items, for a total of 34203\n",
+ "[INFO] Processed another 101 items, for a total of 34304\n",
+ "[INFO] Processed another 101 items, for a total of 34405\n",
+ "[INFO] Processed another 101 items, for a total of 34506\n",
+ "[INFO] Processed another 87 items, for a total of 34593\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0511/x051133;1000,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0512/x051233;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0601/x060133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0602/x060233;0,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 34694\n",
+ "[INFO] Processed another 101 items, for a total of 34795\n",
+ "[INFO] Processed another 101 items, for a total of 34896\n",
+ "[INFO] Processed another 101 items, for a total of 34997\n",
+ "[INFO] Processed another 101 items, for a total of 35098\n",
+ "[INFO] Processed another 101 items, for a total of 35199\n",
+ "[INFO] Processed another 101 items, for a total of 35300\n",
+ "[INFO] Processed another 101 items, for a total of 35401\n",
+ "[INFO] Processed another 101 items, for a total of 35502\n",
+ "[INFO] Processed another 101 items, for a total of 35603\n",
+ "[INFO] Processed another 101 items, for a total of 35704\n",
+ "[INFO] Processed another 101 items, for a total of 35805\n",
+ "[INFO] Processed another 101 items, for a total of 35906\n",
+ "[INFO] Processed another 101 items, for a total of 36007\n",
+ "[INFO] Processed another 22 items, for a total of 36029\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0603/x060333;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 36130\n",
+ "[INFO] Processed another 101 items, for a total of 36231\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 36332\n",
+ "[INFO] Processed another 101 items, for a total of 36433\n",
+ "[INFO] Processed another 101 items, for a total of 36534\n",
+ "[INFO] Processed another 101 items, for a total of 36635\n",
+ "[INFO] Processed another 101 items, for a total of 36736\n",
+ "[INFO] Processed another 101 items, for a total of 36837\n",
+ "[INFO] Processed another 101 items, for a total of 36938\n",
+ "[INFO] Processed another 81 items, for a total of 37019\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0604/x060433;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 37120\n",
+ "[INFO] Processed another 101 items, for a total of 37221\n",
+ "[INFO] Processed another 101 items, for a total of 37322\n",
+ "[INFO] Processed another 101 items, for a total of 37423\n",
+ "[INFO] Processed another 101 items, for a total of 37524\n",
+ "[INFO] Processed another 101 items, for a total of 37625\n",
+ "[INFO] Processed another 101 items, for a total of 37726\n",
+ "[INFO] Processed another 101 items, for a total of 37827\n",
+ "[INFO] Processed another 101 items, for a total of 37928\n",
+ "[INFO] Processed another 101 items, for a total of 38029\n",
+ "[INFO] Processed another 85 items, for a total of 38114\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0605/x060533;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 38215\n",
+ "[INFO] Processed another 101 items, for a total of 38316\n",
+ "[INFO] Processed another 101 items, for a total of 38417\n",
+ "[INFO] Processed another 101 items, for a total of 38518\n",
+ "[INFO] Processed another 101 items, for a total of 38619\n",
+ "[INFO] Processed another 101 items, for a total of 38720\n",
+ "[INFO] Processed another 101 items, for a total of 38821\n",
+ "[INFO] Processed another 101 items, for a total of 38922\n",
+ "[INFO] Processed another 101 items, for a total of 39023\n",
+ "[INFO] Processed another 101 items, for a total of 39124\n",
+ "[INFO] Processed another 79 items, for a total of 39203\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0606/x060633;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 39304\n",
+ "[INFO] Processed another 101 items, for a total of 39405\n",
+ "[INFO] Processed another 101 items, for a total of 39506\n",
+ "[INFO] Processed another 101 items, for a total of 39607\n",
+ "[INFO] Processed another 101 items, for a total of 39708\n",
+ "[INFO] Processed another 101 items, for a total of 39809\n",
+ "[INFO] Processed another 101 items, for a total of 39910\n",
+ "[INFO] Processed another 101 items, for a total of 40011\n",
+ "[INFO] Processed another 101 items, for a total of 40112\n",
+ "[INFO] Processed another 101 items, for a total of 40213\n",
+ "[INFO] Processed another 101 items, for a total of 40314\n",
+ "[INFO] Processed another 8 items, for a total of 40322\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0607/x060733;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 40423\n",
+ "[INFO] Processed another 101 items, for a total of 40524\n",
+ "[INFO] Processed another 101 items, for a total of 40625\n",
+ "[INFO] Processed another 101 items, for a total of 40726\n",
+ "[INFO] Processed another 101 items, for a total of 40827\n",
+ "[INFO] Processed another 101 items, for a total of 40928\n",
+ "[INFO] Processed another 101 items, for a total of 41029\n",
+ "[INFO] Processed another 101 items, for a total of 41130\n",
+ "[INFO] Processed another 101 items, for a total of 41231\n",
+ "[INFO] Processed another 66 items, for a total of 41297\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0608/x060833;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 41398\n",
+ "[INFO] Processed another 101 items, for a total of 41499\n",
+ "[INFO] Processed another 101 items, for a total of 41600\n",
+ "[INFO] Processed another 101 items, for a total of 41701\n",
+ "[INFO] Processed another 101 items, for a total of 41802\n",
+ "[INFO] Processed another 101 items, for a total of 41903\n",
+ "[INFO] Processed another 101 items, for a total of 42004\n",
+ "[INFO] Processed another 101 items, for a total of 42105\n",
+ "[INFO] Processed another 101 items, for a total of 42206\n",
+ "[INFO] Processed another 6 items, for a total of 42212\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0609/x060933;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 42313\n",
+ "[INFO] Processed another 101 items, for a total of 42414\n",
+ "[INFO] Processed another 101 items, for a total of 42515\n",
+ "[INFO] Processed another 101 items, for a total of 42616\n",
+ "[INFO] Processed another 101 items, for a total of 42717\n",
+ "[INFO] Processed another 101 items, for a total of 42818\n",
+ "[INFO] Processed another 101 items, for a total of 42919\n",
+ "[INFO] Processed another 101 items, for a total of 43020\n",
+ "[INFO] Processed another 101 items, for a total of 43121\n",
+ "[INFO] Processed another 72 items, for a total of 43193\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0610/x061033;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 43294\n",
+ "[INFO] Processed another 101 items, for a total of 43395\n",
+ "[INFO] Processed another 101 items, for a total of 43496\n",
+ "[INFO] Processed another 101 items, for a total of 43597\n",
+ "[INFO] Processed another 101 items, for a total of 43698\n",
+ "[INFO] Processed another 101 items, for a total of 43799\n",
+ "[INFO] Processed another 101 items, for a total of 43900\n",
+ "[INFO] Processed another 101 items, for a total of 44001\n",
+ "[INFO] Processed another 101 items, for a total of 44102\n",
+ "[INFO] Processed another 78 items, for a total of 44180\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0611/x061133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 44281\n",
+ "[INFO] Processed another 101 items, for a total of 44382\n",
+ "[INFO] Processed another 101 items, for a total of 44483\n",
+ "[INFO] Processed another 101 items, for a total of 44584\n",
+ "[INFO] Processed another 101 items, for a total of 44685\n",
+ "[INFO] Processed another 101 items, for a total of 44786\n",
+ "[INFO] Processed another 101 items, for a total of 44887\n",
+ "[INFO] Processed another 101 items, for a total of 44988\n",
+ "[INFO] Processed another 101 items, for a total of 45089\n",
+ "[INFO] Processed another 101 items, for a total of 45190\n",
+ "[INFO] Processed another 101 items, for a total of 45291\n",
+ "[INFO] Processed another 86 items, for a total of 45377\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0612/x061233;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 45478\n",
+ "[INFO] Processed another 101 items, for a total of 45579\n",
+ "[INFO] Processed another 101 items, for a total of 45680\n",
+ "[INFO] Processed another 101 items, for a total of 45781\n",
+ "[INFO] Processed another 101 items, for a total of 45882\n",
+ "[INFO] Processed another 101 items, for a total of 45983\n",
+ "[INFO] Processed another 101 items, for a total of 46084\n",
+ "[INFO] Processed another 101 items, for a total of 46185\n",
+ "[INFO] Processed another 101 items, for a total of 46286\n",
+ "[INFO] Processed another 101 items, for a total of 46387\n",
+ "[INFO] Processed another 92 items, for a total of 46479\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0701/x070133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 46580\n",
+ "[INFO] Processed another 101 items, for a total of 46681\n",
+ "[INFO] Processed another 101 items, for a total of 46782\n",
+ "[INFO] Processed another 101 items, for a total of 46883\n",
+ "[INFO] Processed another 101 items, for a total of 46984\n",
+ "[INFO] Processed another 101 items, for a total of 47085\n",
+ "[INFO] Processed another 101 items, for a total of 47186\n",
+ "[INFO] Processed another 101 items, for a total of 47287\n",
+ "[INFO] Processed another 101 items, for a total of 47388\n",
+ "[INFO] Processed another 101 items, for a total of 47489\n",
+ "[INFO] Processed another 72 items, for a total of 47561\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0702/x070233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 47662\n",
+ "[INFO] Processed another 101 items, for a total of 47763\n",
+ "[INFO] Processed another 101 items, for a total of 47864\n",
+ "[INFO] Processed another 101 items, for a total of 47965\n",
+ "[INFO] Processed another 101 items, for a total of 48066\n",
+ "[INFO] Processed another 101 items, for a total of 48167\n",
+ "[INFO] Processed another 101 items, for a total of 48268\n",
+ "[INFO] Processed another 101 items, for a total of 48369\n",
+ "[INFO] Processed another 101 items, for a total of 48470\n",
+ "[INFO] Processed another 87 items, for a total of 48557\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0703/x070333;1000,100.html\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 48658\n",
+ "[INFO] Processed another 101 items, for a total of 48759\n",
+ "[INFO] Processed another 101 items, for a total of 48860\n",
+ "[INFO] Processed another 101 items, for a total of 48961\n",
+ "[INFO] Processed another 101 items, for a total of 49062\n",
+ "[INFO] Processed another 101 items, for a total of 49163\n",
+ "[INFO] Processed another 101 items, for a total of 49264\n",
+ "[INFO] Processed another 101 items, for a total of 49365\n",
+ "[INFO] Processed another 65 items, for a total of 49430\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0704/x070433;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 49531\n",
+ "[INFO] Processed another 101 items, for a total of 49632\n",
+ "[INFO] Processed another 101 items, for a total of 49733\n",
+ "[INFO] Processed another 101 items, for a total of 49834\n",
+ "[INFO] Processed another 101 items, for a total of 49935\n",
+ "[INFO] Processed another 101 items, for a total of 50036\n",
+ "[INFO] Processed another 101 items, for a total of 50137\n",
+ "[INFO] Processed another 101 items, for a total of 50238\n",
+ "[INFO] Processed another 101 items, for a total of 50339\n",
+ "[INFO] Processed another 101 items, for a total of 50440\n",
+ "[INFO] Processed another 101 items, for a total of 50541\n",
+ "[INFO] Processed another 16 items, for a total of 50557\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0705/x070533;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 50658\n",
+ "[INFO] Processed another 101 items, for a total of 50759\n",
+ "[INFO] Processed another 101 items, for a total of 50860\n",
+ "[INFO] Processed another 101 items, for a total of 50961\n",
+ "[INFO] Processed another 101 items, for a total of 51062\n",
+ "[INFO] Processed another 101 items, for a total of 51163\n",
+ "[INFO] Processed another 101 items, for a total of 51264\n",
+ "[INFO] Processed another 101 items, for a total of 51365\n",
+ "[INFO] Processed another 101 items, for a total of 51466\n",
+ "[INFO] Processed another 101 items, for a total of 51567\n",
+ "[INFO] Processed another 35 items, for a total of 51602\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0706/x070633;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 51703\n",
+ "[INFO] Processed another 101 items, for a total of 51804\n",
+ "[INFO] Processed another 101 items, for a total of 51905\n",
+ "[INFO] Processed another 101 items, for a total of 52006\n",
+ "[INFO] Processed another 101 items, for a total of 52107\n",
+ "[INFO] Processed another 101 items, for a total of 52208\n",
+ "[INFO] Processed another 101 items, for a total of 52309\n",
+ "[INFO] Processed another 101 items, for a total of 52410\n",
+ "[INFO] Processed another 101 items, for a total of 52511\n",
+ "[INFO] Processed another 68 items, for a total of 52579\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0707/x070733;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 52680\n",
+ "[INFO] Processed another 101 items, for a total of 52781\n",
+ "[INFO] Processed another 101 items, for a total of 52882\n",
+ "[INFO] Processed another 101 items, for a total of 52983\n",
+ "[INFO] Processed another 101 items, for a total of 53084\n",
+ "[INFO] Processed another 101 items, for a total of 53185\n",
+ "[INFO] Processed another 101 items, for a total of 53286\n",
+ "[INFO] Processed another 101 items, for a total of 53387\n",
+ "[INFO] Processed another 101 items, for a total of 53488\n",
+ "[INFO] Processed another 45 items, for a total of 53533\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0708/x070833;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 53634\n",
+ "[INFO] Processed another 101 items, for a total of 53735\n",
+ "[INFO] Processed another 101 items, for a total of 53836\n",
+ "[INFO] Processed another 101 items, for a total of 53937\n",
+ "[INFO] Processed another 101 items, for a total of 54038\n",
+ "[INFO] Processed another 101 items, for a total of 54139\n",
+ "[INFO] Processed another 98 items, for a total of 54237\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0709/x070933;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 54338\n",
+ "[INFO] Processed another 101 items, for a total of 54439\n",
+ "[INFO] Processed another 101 items, for a total of 54540\n",
+ "[INFO] Processed another 101 items, for a total of 54641\n",
+ "[INFO] Processed another 101 items, for a total of 54742\n",
+ "[INFO] Processed another 101 items, for a total of 54843\n",
+ "[INFO] Processed another 101 items, for a total of 54944\n",
+ "[INFO] Processed another 87 items, for a total of 55031\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0710/x071033;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 55132\n",
+ "[INFO] Processed another 101 items, for a total of 55233\n",
+ "[INFO] Processed another 101 items, for a total of 55334\n",
+ "[INFO] Processed another 101 items, for a total of 55435\n",
+ "[INFO] Processed another 101 items, for a total of 55536\n",
+ "[INFO] Processed another 101 items, for a total of 55637\n",
+ "[INFO] Processed another 101 items, for a total of 55738\n",
+ "[INFO] Processed another 101 items, for a total of 55839\n",
+ "[INFO] Processed another 101 items, for a total of 55940\n",
+ "[INFO] Processed another 101 items, for a total of 56041\n",
+ "[INFO] Processed another 101 items, for a total of 56142\n",
+ "[INFO] Processed another 11 items, for a total of 56153\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0711/x071133;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 56254\n",
+ "[INFO] Processed another 101 items, for a total of 56355\n",
+ "[INFO] Processed another 101 items, for a total of 56456\n",
+ "[INFO] Processed another 101 items, for a total of 56557\n",
+ "[INFO] Processed another 101 items, for a total of 56658\n",
+ "[INFO] Processed another 101 items, for a total of 56759\n",
+ "[INFO] Processed another 101 items, for a total of 56860\n",
+ "[INFO] Processed another 101 items, for a total of 56961\n",
+ "[INFO] Processed another 101 items, for a total of 57062\n",
+ "[INFO] Processed another 101 items, for a total of 57163\n",
+ "[INFO] Processed another 101 items, for a total of 57264\n",
+ "[INFO] Processed another 101 items, for a total of 57365\n",
+ "[INFO] Processed another 101 items, for a total of 57466\n",
+ "[INFO] Processed another 101 items, for a total of 57567\n",
+ "[INFO] Processed another 101 items, for a total of 57668\n",
+ "[INFO] Processed another 21 items, for a total of 57689\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0712/x071233;1600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 57790\n",
+ "[INFO] Processed another 101 items, for a total of 57891\n",
+ "[INFO] Processed another 101 items, for a total of 57992\n",
+ "[INFO] Processed another 101 items, for a total of 58093\n",
+ "[INFO] Processed another 101 items, for a total of 58194\n",
+ "[INFO] Processed another 101 items, for a total of 58295\n",
+ "[INFO] Processed another 101 items, for a total of 58396\n",
+ "[INFO] Processed another 101 items, for a total of 58497\n",
+ "[INFO] Processed another 101 items, for a total of 58598\n",
+ "[INFO] Processed another 101 items, for a total of 58699\n",
+ "[INFO] Processed another 38 items, for a total of 58737\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0801/x080133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 58838\n",
+ "[INFO] Processed another 101 items, for a total of 58939\n",
+ "[INFO] Processed another 101 items, for a total of 59040\n",
+ "[INFO] Processed another 101 items, for a total of 59141\n",
+ "[INFO] Processed another 101 items, for a total of 59242\n",
+ "[INFO] Processed another 101 items, for a total of 59343\n",
+ "[INFO] Processed another 101 items, for a total of 59444\n",
+ "[INFO] Processed another 101 items, for a total of 59545\n",
+ "[INFO] Processed another 101 items, for a total of 59646\n",
+ "[INFO] Processed another 101 items, for a total of 59747\n",
+ "[INFO] Processed another 80 items, for a total of 59827\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0802/x080233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 59928\n",
+ "[INFO] Processed another 101 items, for a total of 60029\n",
+ "[INFO] Processed another 101 items, for a total of 60130\n",
+ "[INFO] Processed another 101 items, for a total of 60231\n",
+ "[INFO] Processed another 101 items, for a total of 60332\n",
+ "[INFO] Processed another 101 items, for a total of 60433\n",
+ "[INFO] Processed another 101 items, for a total of 60534\n",
+ "[INFO] Processed another 101 items, for a total of 60635\n",
+ "[INFO] Processed another 101 items, for a total of 60736\n",
+ "[INFO] Processed another 101 items, for a total of 60837\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 20 items, for a total of 60857\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0803/x080333;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 60958\n",
+ "[INFO] Processed another 101 items, for a total of 61059\n",
+ "[INFO] Processed another 101 items, for a total of 61160\n",
+ "[INFO] Processed another 101 items, for a total of 61261\n",
+ "[INFO] Processed another 101 items, for a total of 61362\n",
+ "[INFO] Processed another 101 items, for a total of 61463\n",
+ "[INFO] Processed another 101 items, for a total of 61564\n",
+ "[INFO] Processed another 101 items, for a total of 61665\n",
+ "[INFO] Processed another 101 items, for a total of 61766\n",
+ "[INFO] Processed another 97 items, for a total of 61863\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0804/x080433;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 61964\n",
+ "[INFO] Processed another 101 items, for a total of 62065\n",
+ "[INFO] Processed another 101 items, for a total of 62166\n",
+ "[INFO] Processed another 101 items, for a total of 62267\n",
+ "[INFO] Processed another 101 items, for a total of 62368\n",
+ "[INFO] Processed another 101 items, for a total of 62469\n",
+ "[INFO] Processed another 101 items, for a total of 62570\n",
+ "[INFO] Processed another 101 items, for a total of 62671\n",
+ "[INFO] Processed another 101 items, for a total of 62772\n",
+ "[INFO] Processed another 101 items, for a total of 62873\n",
+ "[INFO] Processed another 28 items, for a total of 62901\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0805/x080533;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 63002\n",
+ "[INFO] Processed another 101 items, for a total of 63103\n",
+ "[INFO] Processed another 101 items, for a total of 63204\n",
+ "[INFO] Processed another 101 items, for a total of 63305\n",
+ "[INFO] Processed another 101 items, for a total of 63406\n",
+ "[INFO] Processed another 101 items, for a total of 63507\n",
+ "[INFO] Processed another 101 items, for a total of 63608\n",
+ "[INFO] Processed another 101 items, for a total of 63709\n",
+ "[INFO] Processed another 101 items, for a total of 63810\n",
+ "[INFO] Processed another 101 items, for a total of 63911\n",
+ "[INFO] Processed another 101 items, for a total of 64012\n",
+ "[INFO] Processed another 51 items, for a total of 64063\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0806/x080633;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 64164\n",
+ "[INFO] Processed another 101 items, for a total of 64265\n",
+ "[INFO] Processed another 101 items, for a total of 64366\n",
+ "[INFO] Processed another 101 items, for a total of 64467\n",
+ "[INFO] Processed another 101 items, for a total of 64568\n",
+ "[INFO] Processed another 101 items, for a total of 64669\n",
+ "[INFO] Processed another 101 items, for a total of 64770\n",
+ "[INFO] Processed another 101 items, for a total of 64871\n",
+ "[INFO] Processed another 69 items, for a total of 64940\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0807/x080733;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 65041\n",
+ "[INFO] Processed another 101 items, for a total of 65142\n",
+ "[INFO] Processed another 101 items, for a total of 65243\n",
+ "[INFO] Processed another 101 items, for a total of 65344\n",
+ "[INFO] Processed another 101 items, for a total of 65445\n",
+ "[INFO] Processed another 101 items, for a total of 65546\n",
+ "[INFO] Processed another 48 items, for a total of 65594\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0808/x080833;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 65695\n",
+ "[INFO] Processed another 101 items, for a total of 65796\n",
+ "[INFO] Processed another 101 items, for a total of 65897\n",
+ "[INFO] Processed another 101 items, for a total of 65998\n",
+ "[INFO] Processed another 101 items, for a total of 66099\n",
+ "[INFO] Processed another 101 items, for a total of 66200\n",
+ "[INFO] Processed another 101 items, for a total of 66301\n",
+ "[INFO] Processed another 101 items, for a total of 66402\n",
+ "[INFO] Processed another 82 items, for a total of 66484\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0809/x080933;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 66585\n",
+ "[INFO] Processed another 101 items, for a total of 66686\n",
+ "[INFO] Processed another 101 items, for a total of 66787\n",
+ "[INFO] Processed another 101 items, for a total of 66888\n",
+ "[INFO] Processed another 101 items, for a total of 66989\n",
+ "[INFO] Processed another 101 items, for a total of 67090\n",
+ "[INFO] Processed another 101 items, for a total of 67191\n",
+ "[INFO] Processed another 32 items, for a total of 67223\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0810/x081033;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 67324\n",
+ "[INFO] Processed another 101 items, for a total of 67425\n",
+ "[INFO] Processed another 101 items, for a total of 67526\n",
+ "[INFO] Processed another 101 items, for a total of 67627\n",
+ "[INFO] Processed another 101 items, for a total of 67728\n",
+ "[INFO] Processed another 101 items, for a total of 67829\n",
+ "[INFO] Processed another 101 items, for a total of 67930\n",
+ "[INFO] Processed another 101 items, for a total of 68031\n",
+ "[INFO] Processed another 101 items, for a total of 68132\n",
+ "[INFO] Processed another 80 items, for a total of 68212\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0811/x081133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 68313\n",
+ "[INFO] Processed another 101 items, for a total of 68414\n",
+ "[INFO] Processed another 101 items, for a total of 68515\n",
+ "[INFO] Processed another 101 items, for a total of 68616\n",
+ "[INFO] Processed another 101 items, for a total of 68717\n",
+ "[INFO] Processed another 101 items, for a total of 68818\n",
+ "[INFO] Processed another 101 items, for a total of 68919\n",
+ "[INFO] Processed another 101 items, for a total of 69020\n",
+ "[INFO] Processed another 101 items, for a total of 69121\n",
+ "[INFO] Processed another 48 items, for a total of 69169\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0812/x081233;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 69270\n",
+ "[INFO] Processed another 101 items, for a total of 69371\n",
+ "[INFO] Processed another 101 items, for a total of 69472\n",
+ "[INFO] Processed another 101 items, for a total of 69573\n",
+ "[INFO] Processed another 101 items, for a total of 69674\n",
+ "[INFO] Processed another 101 items, for a total of 69775\n",
+ "[INFO] Processed another 101 items, for a total of 69876\n",
+ "[INFO] Processed another 101 items, for a total of 69977\n",
+ "[INFO] Processed another 101 items, for a total of 70078\n",
+ "[INFO] Processed another 101 items, for a total of 70179\n",
+ "[INFO] Processed another 101 items, for a total of 70280\n",
+ "[INFO] Processed another 101 items, for a total of 70381\n",
+ "[INFO] Processed another 83 items, for a total of 70464\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0901/x090133;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 70565\n",
+ "[INFO] Processed another 101 items, for a total of 70666\n",
+ "[INFO] Processed another 101 items, for a total of 70767\n",
+ "[INFO] Processed another 101 items, for a total of 70868\n",
+ "[INFO] Processed another 101 items, for a total of 70969\n",
+ "[INFO] Processed another 101 items, for a total of 71070\n",
+ "[INFO] Processed another 101 items, for a total of 71171\n",
+ "[INFO] Processed another 101 items, for a total of 71272\n",
+ "[INFO] Processed another 7 items, for a total of 71279\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0902/x090233;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 71380\n",
+ "[INFO] Processed another 101 items, for a total of 71481\n",
+ "[INFO] Processed another 101 items, for a total of 71582\n",
+ "[INFO] Processed another 101 items, for a total of 71683\n",
+ "[INFO] Processed another 101 items, for a total of 71784\n",
+ "[INFO] Processed another 101 items, for a total of 71885\n",
+ "[INFO] Processed another 101 items, for a total of 71986\n",
+ "[INFO] Processed another 86 items, for a total of 72072\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0903/x090333;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 72173\n",
+ "[INFO] Processed another 101 items, for a total of 72274\n",
+ "[INFO] Processed another 101 items, for a total of 72375\n",
+ "[INFO] Processed another 101 items, for a total of 72476\n",
+ "[INFO] Processed another 101 items, for a total of 72577\n",
+ "[INFO] Processed another 101 items, for a total of 72678\n",
+ "[INFO] Processed another 101 items, for a total of 72779\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 72880\n",
+ "[INFO] Processed another 101 items, for a total of 72981\n",
+ "[INFO] Processed another 101 items, for a total of 73082\n",
+ "[INFO] Processed another 101 items, for a total of 73183\n",
+ "[INFO] Processed another 101 items, for a total of 73284\n",
+ "[INFO] Processed another 68 items, for a total of 73352\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0904/x090433;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 73453\n",
+ "[INFO] Processed another 101 items, for a total of 73554\n",
+ "[INFO] Processed another 101 items, for a total of 73655\n",
+ "[INFO] Processed another 101 items, for a total of 73756\n",
+ "[INFO] Processed another 101 items, for a total of 73857\n",
+ "[INFO] Processed another 101 items, for a total of 73958\n",
+ "[INFO] Processed another 101 items, for a total of 74059\n",
+ "[INFO] Processed another 101 items, for a total of 74160\n",
+ "[INFO] Processed another 101 items, for a total of 74261\n",
+ "[INFO] Processed another 101 items, for a total of 74362\n",
+ "[INFO] Processed another 95 items, for a total of 74457\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0905/x090533;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 74558\n",
+ "[INFO] Processed another 101 items, for a total of 74659\n",
+ "[INFO] Processed another 101 items, for a total of 74760\n",
+ "[INFO] Processed another 101 items, for a total of 74861\n",
+ "[INFO] Processed another 101 items, for a total of 74962\n",
+ "[INFO] Processed another 101 items, for a total of 75063\n",
+ "[INFO] Processed another 101 items, for a total of 75164\n",
+ "[INFO] Processed another 101 items, for a total of 75265\n",
+ "[INFO] Processed another 6 items, for a total of 75271\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0906/x090633;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 75372\n",
+ "[INFO] Processed another 101 items, for a total of 75473\n",
+ "[INFO] Processed another 101 items, for a total of 75574\n",
+ "[INFO] Processed another 101 items, for a total of 75675\n",
+ "[INFO] Processed another 101 items, for a total of 75776\n",
+ "[INFO] Processed another 101 items, for a total of 75877\n",
+ "[INFO] Processed another 98 items, for a total of 75975\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0907/x090733;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 76076\n",
+ "[INFO] Processed another 101 items, for a total of 76177\n",
+ "[INFO] Processed another 101 items, for a total of 76278\n",
+ "[INFO] Processed another 101 items, for a total of 76379\n",
+ "[INFO] Processed another 101 items, for a total of 76480\n",
+ "[INFO] Processed another 39 items, for a total of 76519\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0908/x090833;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 76620\n",
+ "[INFO] Processed another 101 items, for a total of 76721\n",
+ "[INFO] Processed another 101 items, for a total of 76822\n",
+ "[INFO] Processed another 101 items, for a total of 76923\n",
+ "[INFO] Processed another 101 items, for a total of 77024\n",
+ "[INFO] Processed another 39 items, for a total of 77063\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0909/x090933;600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 77164\n",
+ "[INFO] Processed another 101 items, for a total of 77265\n",
+ "[INFO] Processed another 101 items, for a total of 77366\n",
+ "[INFO] Processed another 101 items, for a total of 77467\n",
+ "[INFO] Processed another 101 items, for a total of 77568\n",
+ "[INFO] Processed another 101 items, for a total of 77669\n",
+ "[INFO] Processed another 101 items, for a total of 77770\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0910/x091033;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 77871\n",
+ "[INFO] Processed another 101 items, for a total of 77972\n",
+ "[INFO] Processed another 101 items, for a total of 78073\n",
+ "[INFO] Processed another 101 items, for a total of 78174\n",
+ "[INFO] Processed another 101 items, for a total of 78275\n",
+ "[INFO] Processed another 101 items, for a total of 78376\n",
+ "[INFO] Processed another 36 items, for a total of 78412\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0911/x091133;700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 78513\n",
+ "[INFO] Processed another 101 items, for a total of 78614\n",
+ "[INFO] Processed another 101 items, for a total of 78715\n",
+ "[INFO] Processed another 101 items, for a total of 78816\n",
+ "[INFO] Processed another 101 items, for a total of 78917\n",
+ "[INFO] Processed another 101 items, for a total of 79018\n",
+ "[INFO] Processed another 101 items, for a total of 79119\n",
+ "[INFO] Processed another 41 items, for a total of 79160\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an0912/x091233;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 79261\n",
+ "[INFO] Processed another 101 items, for a total of 79362\n",
+ "[INFO] Processed another 101 items, for a total of 79463\n",
+ "[INFO] Processed another 101 items, for a total of 79564\n",
+ "[INFO] Processed another 101 items, for a total of 79665\n",
+ "[INFO] Processed another 101 items, for a total of 79766\n",
+ "[INFO] Processed another 101 items, for a total of 79867\n",
+ "[INFO] Processed another 40 items, for a total of 79907\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1001/x100133;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 80008\n",
+ "[INFO] Processed another 101 items, for a total of 80109\n",
+ "[INFO] Processed another 101 items, for a total of 80210\n",
+ "[INFO] Processed another 101 items, for a total of 80311\n",
+ "[INFO] Processed another 101 items, for a total of 80412\n",
+ "[INFO] Processed another 101 items, for a total of 80513\n",
+ "[INFO] Processed another 101 items, for a total of 80614\n",
+ "[INFO] Processed another 101 items, for a total of 80715\n",
+ "[INFO] Processed another 101 items, for a total of 80816\n",
+ "[INFO] Processed another 54 items, for a total of 80870\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1002/x100233;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 80971\n",
+ "[INFO] Processed another 101 items, for a total of 81072\n",
+ "[INFO] Processed another 101 items, for a total of 81173\n",
+ "[INFO] Processed another 101 items, for a total of 81274\n",
+ "[INFO] Processed another 101 items, for a total of 81375\n",
+ "[INFO] Processed another 101 items, for a total of 81476\n",
+ "[INFO] Processed another 101 items, for a total of 81577\n",
+ "[INFO] Processed another 101 items, for a total of 81678\n",
+ "[INFO] Processed another 101 items, for a total of 81779\n",
+ "[INFO] Processed another 46 items, for a total of 81825\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1003/x100333;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 81926\n",
+ "[INFO] Processed another 101 items, for a total of 82027\n",
+ "[INFO] Processed another 101 items, for a total of 82128\n",
+ "[INFO] Processed another 101 items, for a total of 82229\n",
+ "[INFO] Processed another 101 items, for a total of 82330\n",
+ "[INFO] Processed another 101 items, for a total of 82431\n",
+ "[INFO] Processed another 101 items, for a total of 82532\n",
+ "[INFO] Processed another 101 items, for a total of 82633\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1004/x100433;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 82734\n",
+ "[INFO] Processed another 101 items, for a total of 82835\n",
+ "[INFO] Processed another 101 items, for a total of 82936\n",
+ "[INFO] Processed another 101 items, for a total of 83037\n",
+ "[INFO] Processed another 101 items, for a total of 83138\n",
+ "[INFO] Processed another 101 items, for a total of 83239\n",
+ "[INFO] Processed another 101 items, for a total of 83340\n",
+ "[INFO] Processed another 101 items, for a total of 83441\n",
+ "[INFO] Processed another 101 items, for a total of 83542\n",
+ "[INFO] Processed another 39 items, for a total of 83581\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1005/x100533;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 83682\n",
+ "[INFO] Processed another 101 items, for a total of 83783\n",
+ "[INFO] Processed another 101 items, for a total of 83884\n",
+ "[INFO] Processed another 101 items, for a total of 83985\n",
+ "[INFO] Processed another 101 items, for a total of 84086\n",
+ "[INFO] Processed another 101 items, for a total of 84187\n",
+ "[INFO] Processed another 101 items, for a total of 84288\n",
+ "[INFO] Processed another 101 items, for a total of 84389\n",
+ "[INFO] Processed another 101 items, for a total of 84490\n",
+ "[INFO] Processed another 34 items, for a total of 84524\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1006/x100633;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 84625\n",
+ "[INFO] Processed another 101 items, for a total of 84726\n",
+ "[INFO] Processed another 101 items, for a total of 84827\n",
+ "[INFO] Processed another 101 items, for a total of 84928\n",
+ "[INFO] Processed another 101 items, for a total of 85029\n",
+ "[INFO] Processed another 101 items, for a total of 85130\n",
+ "[INFO] Processed another 101 items, for a total of 85231\n",
+ "[INFO] Processed another 89 items, for a total of 85320\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1007/x100733;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 85421\n",
+ "[INFO] Processed another 101 items, for a total of 85522\n",
+ "[INFO] Processed another 101 items, for a total of 85623\n",
+ "[INFO] Processed another 101 items, for a total of 85724\n",
+ "[INFO] Processed another 101 items, for a total of 85825\n",
+ "[INFO] Processed another 101 items, for a total of 85926\n",
+ "[INFO] Processed another 101 items, for a total of 86027\n",
+ "[INFO] Processed another 101 items, for a total of 86128\n",
+ "[INFO] Processed another 101 items, for a total of 86229\n",
+ "[INFO] Processed another 101 items, for a total of 86330\n",
+ "[INFO] Processed another 75 items, for a total of 86405\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1008/x100833;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 86506\n",
+ "[INFO] Processed another 101 items, for a total of 86607\n",
+ "[INFO] Processed another 101 items, for a total of 86708\n",
+ "[INFO] Processed another 101 items, for a total of 86809\n",
+ "[INFO] Processed another 101 items, for a total of 86910\n",
+ "[INFO] Processed another 101 items, for a total of 87011\n",
+ "[INFO] Processed another 101 items, for a total of 87112\n",
+ "[INFO] Processed another 101 items, for a total of 87213\n",
+ "[INFO] Processed another 101 items, for a total of 87314\n",
+ "[INFO] Processed another 95 items, for a total of 87409\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1009/x100933;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 87510\n",
+ "[INFO] Processed another 101 items, for a total of 87611\n",
+ "[INFO] Processed another 101 items, for a total of 87712\n",
+ "[INFO] Processed another 101 items, for a total of 87813\n",
+ "[INFO] Processed another 101 items, for a total of 87914\n",
+ "[INFO] Processed another 101 items, for a total of 88015\n",
+ "[INFO] Processed another 101 items, for a total of 88116\n",
+ "[INFO] Processed another 101 items, for a total of 88217\n",
+ "[INFO] Processed another 101 items, for a total of 88318\n",
+ "[INFO] Processed another 101 items, for a total of 88419\n",
+ "[INFO] Processed another 22 items, for a total of 88441\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1010/x101033;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 88542\n",
+ "[INFO] Processed another 101 items, for a total of 88643\n",
+ "[INFO] Processed another 101 items, for a total of 88744\n",
+ "[INFO] Processed another 101 items, for a total of 88845\n",
+ "[INFO] Processed another 101 items, for a total of 88946\n",
+ "[INFO] Processed another 101 items, for a total of 89047\n",
+ "[INFO] Processed another 101 items, for a total of 89148\n",
+ "[INFO] Processed another 101 items, for a total of 89249\n",
+ "[INFO] Processed another 101 items, for a total of 89350\n",
+ "[INFO] Processed another 99 items, for a total of 89449\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1011/x101133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 89550\n",
+ "[INFO] Processed another 101 items, for a total of 89651\n",
+ "[INFO] Processed another 101 items, for a total of 89752\n",
+ "[INFO] Processed another 101 items, for a total of 89853\n",
+ "[INFO] Processed another 101 items, for a total of 89954\n",
+ "[INFO] Processed another 101 items, for a total of 90055\n",
+ "[INFO] Processed another 101 items, for a total of 90156\n",
+ "[INFO] Processed another 101 items, for a total of 90257\n",
+ "[INFO] Processed another 101 items, for a total of 90358\n",
+ "[INFO] Processed another 101 items, for a total of 90459\n",
+ "[INFO] Processed another 101 items, for a total of 90560\n",
+ "[INFO] Processed another 101 items, for a total of 90661\n",
+ "[INFO] Processed another 11 items, for a total of 90672\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1012/x101233;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 90773\n",
+ "[INFO] Processed another 101 items, for a total of 90874\n",
+ "[INFO] Processed another 101 items, for a total of 90975\n",
+ "[INFO] Processed another 101 items, for a total of 91076\n",
+ "[INFO] Processed another 101 items, for a total of 91177\n",
+ "[INFO] Processed another 101 items, for a total of 91278\n",
+ "[INFO] Processed another 101 items, for a total of 91379\n",
+ "[INFO] Processed another 101 items, for a total of 91480\n",
+ "[INFO] Processed another 75 items, for a total of 91555\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1101/x110133;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 91656\n",
+ "[INFO] Processed another 101 items, for a total of 91757\n",
+ "[INFO] Processed another 101 items, for a total of 91858\n",
+ "[INFO] Processed another 101 items, for a total of 91959\n",
+ "[INFO] Processed another 101 items, for a total of 92060\n",
+ "[INFO] Processed another 101 items, for a total of 92161\n",
+ "[INFO] Processed another 101 items, for a total of 92262\n",
+ "[INFO] Processed another 101 items, for a total of 92363\n",
+ "[INFO] Processed another 101 items, for a total of 92464\n",
+ "[INFO] Processed another 101 items, for a total of 92565\n",
+ "[INFO] Processed another 93 items, for a total of 92658\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1102/x110233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 92759\n",
+ "[INFO] Processed another 101 items, for a total of 92860\n",
+ "[INFO] Processed another 101 items, for a total of 92961\n",
+ "[INFO] Processed another 101 items, for a total of 93062\n",
+ "[INFO] Processed another 101 items, for a total of 93163\n",
+ "[INFO] Processed another 101 items, for a total of 93264\n",
+ "[INFO] Processed another 101 items, for a total of 93365\n",
+ "[INFO] Processed another 101 items, for a total of 93466\n",
+ "[INFO] Processed another 101 items, for a total of 93567\n",
+ "[INFO] Processed another 101 items, for a total of 93668\n",
+ "[INFO] Processed another 101 items, for a total of 93769\n",
+ "[INFO] Processed another 101 items, for a total of 93870\n",
+ "[INFO] Processed another 101 items, for a total of 93971\n",
+ "[INFO] Processed another 89 items, for a total of 94060\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1103/x110333;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 94161\n",
+ "[INFO] Processed another 101 items, for a total of 94262\n",
+ "[INFO] Processed another 101 items, for a total of 94363\n",
+ "[INFO] Processed another 101 items, for a total of 94464\n",
+ "[INFO] Processed another 101 items, for a total of 94565\n",
+ "[INFO] Processed another 101 items, for a total of 94666\n",
+ "[INFO] Processed another 101 items, for a total of 94767\n",
+ "[INFO] Processed another 101 items, for a total of 94868\n",
+ "[INFO] Processed another 101 items, for a total of 94969\n",
+ "[INFO] Processed another 101 items, for a total of 95070\n",
+ "[INFO] Processed another 47 items, for a total of 95117\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1104/x110433;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 95218\n",
+ "[INFO] Processed another 101 items, for a total of 95319\n",
+ "[INFO] Processed another 101 items, for a total of 95420\n",
+ "[INFO] Processed another 101 items, for a total of 95521\n",
+ "[INFO] Processed another 101 items, for a total of 95622\n",
+ "[INFO] Processed another 101 items, for a total of 95723\n",
+ "[INFO] Processed another 101 items, for a total of 95824\n",
+ "[INFO] Processed another 101 items, for a total of 95925\n",
+ "[INFO] Processed another 101 items, for a total of 96026\n",
+ "[INFO] Processed another 101 items, for a total of 96127\n",
+ "[INFO] Processed another 6 items, for a total of 96133\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1105/x110533;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 96234\n",
+ "[INFO] Processed another 101 items, for a total of 96335\n",
+ "[INFO] Processed another 101 items, for a total of 96436\n",
+ "[INFO] Processed another 101 items, for a total of 96537\n",
+ "[INFO] Processed another 101 items, for a total of 96638\n",
+ "[INFO] Processed another 101 items, for a total of 96739\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 96840\n",
+ "[INFO] Processed another 101 items, for a total of 96941\n",
+ "[INFO] Processed another 101 items, for a total of 97042\n",
+ "[INFO] Processed another 74 items, for a total of 97116\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1106/x110633;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 97217\n",
+ "[INFO] Processed another 101 items, for a total of 97318\n",
+ "[INFO] Processed another 101 items, for a total of 97419\n",
+ "[INFO] Processed another 101 items, for a total of 97520\n",
+ "[INFO] Processed another 101 items, for a total of 97621\n",
+ "[INFO] Processed another 101 items, for a total of 97722\n",
+ "[INFO] Processed another 101 items, for a total of 97823\n",
+ "[INFO] Processed another 101 items, for a total of 97924\n",
+ "[INFO] Processed another 73 items, for a total of 97997\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1107/x110733;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 98098\n",
+ "[INFO] Processed another 101 items, for a total of 98199\n",
+ "[INFO] Processed another 101 items, for a total of 98300\n",
+ "[INFO] Processed another 101 items, for a total of 98401\n",
+ "[INFO] Processed another 101 items, for a total of 98502\n",
+ "[INFO] Processed another 101 items, for a total of 98603\n",
+ "[INFO] Processed another 101 items, for a total of 98704\n",
+ "[INFO] Processed another 101 items, for a total of 98805\n",
+ "[INFO] Processed another 37 items, for a total of 98842\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1108/x110833;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 98943\n",
+ "[INFO] Processed another 101 items, for a total of 99044\n",
+ "[INFO] Processed another 101 items, for a total of 99145\n",
+ "[INFO] Processed another 101 items, for a total of 99246\n",
+ "[INFO] Processed another 101 items, for a total of 99347\n",
+ "[INFO] Processed another 101 items, for a total of 99448\n",
+ "[INFO] Processed another 101 items, for a total of 99549\n",
+ "[INFO] Processed another 101 items, for a total of 99650\n",
+ "[INFO] Processed another 101 items, for a total of 99751\n",
+ "[INFO] Processed another 101 items, for a total of 99852\n",
+ "[INFO] Processed another 35 items, for a total of 99887\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1109/x110933;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 99988\n",
+ "[INFO] Processed another 101 items, for a total of 100089\n",
+ "[INFO] Processed another 101 items, for a total of 100190\n",
+ "[INFO] Processed another 101 items, for a total of 100291\n",
+ "[INFO] Processed another 101 items, for a total of 100392\n",
+ "[INFO] Processed another 101 items, for a total of 100493\n",
+ "[INFO] Processed another 101 items, for a total of 100594\n",
+ "[INFO] Processed another 101 items, for a total of 100695\n",
+ "[INFO] Processed another 101 items, for a total of 100796\n",
+ "[INFO] Processed another 101 items, for a total of 100897\n",
+ "[INFO] Processed another 24 items, for a total of 100921\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1110/x111033;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 101022\n",
+ "[INFO] Processed another 101 items, for a total of 101123\n",
+ "[INFO] Processed another 101 items, for a total of 101224\n",
+ "[INFO] Processed another 101 items, for a total of 101325\n",
+ "[INFO] Processed another 101 items, for a total of 101426\n",
+ "[INFO] Processed another 101 items, for a total of 101527\n",
+ "[INFO] Processed another 101 items, for a total of 101628\n",
+ "[INFO] Processed another 101 items, for a total of 101729\n",
+ "[INFO] Processed another 101 items, for a total of 101830\n",
+ "[INFO] Processed another 101 items, for a total of 101931\n",
+ "[INFO] Processed another 64 items, for a total of 101995\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1111/x111133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 102096\n",
+ "[INFO] Processed another 101 items, for a total of 102197\n",
+ "[INFO] Processed another 101 items, for a total of 102298\n",
+ "[INFO] Processed another 101 items, for a total of 102399\n",
+ "[INFO] Processed another 101 items, for a total of 102500\n",
+ "[INFO] Processed another 101 items, for a total of 102601\n",
+ "[INFO] Processed another 101 items, for a total of 102702\n",
+ "[INFO] Processed another 101 items, for a total of 102803\n",
+ "[INFO] Processed another 101 items, for a total of 102904\n",
+ "[INFO] Processed another 101 items, for a total of 103005\n",
+ "[INFO] Processed another 101 items, for a total of 103106\n",
+ "[INFO] Processed another 101 items, for a total of 103207\n",
+ "[INFO] Processed another 101 items, for a total of 103308\n",
+ "[INFO] Processed another 71 items, for a total of 103379\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1112/x111233;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 103480\n",
+ "[INFO] Processed another 101 items, for a total of 103581\n",
+ "[INFO] Processed another 101 items, for a total of 103682\n",
+ "[INFO] Processed another 101 items, for a total of 103783\n",
+ "[INFO] Processed another 101 items, for a total of 103884\n",
+ "[INFO] Processed another 101 items, for a total of 103985\n",
+ "[INFO] Processed another 101 items, for a total of 104086\n",
+ "[INFO] Processed another 101 items, for a total of 104187\n",
+ "[INFO] Processed another 101 items, for a total of 104288\n",
+ "[INFO] Processed another 101 items, for a total of 104389\n",
+ "[INFO] Processed another 51 items, for a total of 104440\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1201/x120133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 104541\n",
+ "[INFO] Processed another 101 items, for a total of 104642\n",
+ "[INFO] Processed another 101 items, for a total of 104743\n",
+ "[INFO] Processed another 101 items, for a total of 104844\n",
+ "[INFO] Processed another 101 items, for a total of 104945\n",
+ "[INFO] Processed another 101 items, for a total of 105046\n",
+ "[INFO] Processed another 101 items, for a total of 105147\n",
+ "[INFO] Processed another 101 items, for a total of 105248\n",
+ "[INFO] Processed another 101 items, for a total of 105349\n",
+ "[INFO] Processed another 101 items, for a total of 105450\n",
+ "[INFO] Processed another 101 items, for a total of 105551\n",
+ "[INFO] Processed another 101 items, for a total of 105652\n",
+ "[INFO] Processed another 66 items, for a total of 105718\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1202/x120233;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 105819\n",
+ "[INFO] Processed another 101 items, for a total of 105920\n",
+ "[INFO] Processed another 101 items, for a total of 106021\n",
+ "[INFO] Processed another 101 items, for a total of 106122\n",
+ "[INFO] Processed another 101 items, for a total of 106223\n",
+ "[INFO] Processed another 101 items, for a total of 106324\n",
+ "[INFO] Processed another 101 items, for a total of 106425\n",
+ "[INFO] Processed another 101 items, for a total of 106526\n",
+ "[INFO] Processed another 101 items, for a total of 106627\n",
+ "[INFO] Processed another 101 items, for a total of 106728\n",
+ "[INFO] Processed another 101 items, for a total of 106829\n",
+ "[INFO] Processed another 101 items, for a total of 106930\n",
+ "[INFO] Processed another 52 items, for a total of 106982\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1203/x120333;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 107083\n",
+ "[INFO] Processed another 101 items, for a total of 107184\n",
+ "[INFO] Processed another 101 items, for a total of 107285\n",
+ "[INFO] Processed another 101 items, for a total of 107386\n",
+ "[INFO] Processed another 101 items, for a total of 107487\n",
+ "[INFO] Processed another 101 items, for a total of 107588\n",
+ "[INFO] Processed another 101 items, for a total of 107689\n",
+ "[INFO] Processed another 101 items, for a total of 107790\n",
+ "[INFO] Processed another 101 items, for a total of 107891\n",
+ "[INFO] Processed another 70 items, for a total of 107961\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1204/x120433;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 108062\n",
+ "[INFO] Processed another 101 items, for a total of 108163\n",
+ "[INFO] Processed another 101 items, for a total of 108264\n",
+ "[INFO] Processed another 101 items, for a total of 108365\n",
+ "[INFO] Processed another 101 items, for a total of 108466\n",
+ "[INFO] Processed another 101 items, for a total of 108567\n",
+ "[INFO] Processed another 101 items, for a total of 108668\n",
+ "[INFO] Processed another 101 items, for a total of 108769\n",
+ "[INFO] Processed another 101 items, for a total of 108870\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 52 items, for a total of 108922\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1205/x120533;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 109023\n",
+ "[INFO] Processed another 101 items, for a total of 109124\n",
+ "[INFO] Processed another 101 items, for a total of 109225\n",
+ "[INFO] Processed another 101 items, for a total of 109326\n",
+ "[INFO] Processed another 101 items, for a total of 109427\n",
+ "[INFO] Processed another 101 items, for a total of 109528\n",
+ "[INFO] Processed another 101 items, for a total of 109629\n",
+ "[INFO] Processed another 101 items, for a total of 109730\n",
+ "[INFO] Processed another 101 items, for a total of 109831\n",
+ "[INFO] Processed another 101 items, for a total of 109932\n",
+ "[INFO] Processed another 33 items, for a total of 109965\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1206/x120633;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 110066\n",
+ "[INFO] Processed another 101 items, for a total of 110167\n",
+ "[INFO] Processed another 101 items, for a total of 110268\n",
+ "[INFO] Processed another 101 items, for a total of 110369\n",
+ "[INFO] Processed another 101 items, for a total of 110470\n",
+ "[INFO] Processed another 101 items, for a total of 110571\n",
+ "[INFO] Processed another 101 items, for a total of 110672\n",
+ "[INFO] Processed another 101 items, for a total of 110773\n",
+ "[INFO] Processed another 101 items, for a total of 110874\n",
+ "[INFO] Processed another 78 items, for a total of 110952\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1207/x120733;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 111053\n",
+ "[INFO] Processed another 101 items, for a total of 111154\n",
+ "[INFO] Processed another 101 items, for a total of 111255\n",
+ "[INFO] Processed another 101 items, for a total of 111356\n",
+ "[INFO] Processed another 101 items, for a total of 111457\n",
+ "[INFO] Processed another 101 items, for a total of 111558\n",
+ "[INFO] Processed another 101 items, for a total of 111659\n",
+ "[INFO] Processed another 101 items, for a total of 111760\n",
+ "[INFO] Processed another 101 items, for a total of 111861\n",
+ "[INFO] Processed another 49 items, for a total of 111910\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1208/x120833;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 112011\n",
+ "[INFO] Processed another 101 items, for a total of 112112\n",
+ "[INFO] Processed another 101 items, for a total of 112213\n",
+ "[INFO] Processed another 101 items, for a total of 112314\n",
+ "[INFO] Processed another 101 items, for a total of 112415\n",
+ "[INFO] Processed another 101 items, for a total of 112516\n",
+ "[INFO] Processed another 101 items, for a total of 112617\n",
+ "[INFO] Processed another 101 items, for a total of 112718\n",
+ "[INFO] Processed another 101 items, for a total of 112819\n",
+ "[INFO] Processed another 70 items, for a total of 112889\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1209/x120933;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 112990\n",
+ "[INFO] Processed another 101 items, for a total of 113091\n",
+ "[INFO] Processed another 101 items, for a total of 113192\n",
+ "[INFO] Processed another 101 items, for a total of 113293\n",
+ "[INFO] Processed another 101 items, for a total of 113394\n",
+ "[INFO] Processed another 101 items, for a total of 113495\n",
+ "[INFO] Processed another 101 items, for a total of 113596\n",
+ "[INFO] Processed another 101 items, for a total of 113697\n",
+ "[INFO] Processed another 101 items, for a total of 113798\n",
+ "[INFO] Processed another 51 items, for a total of 113849\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1210/x121033;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 113950\n",
+ "[INFO] Processed another 101 items, for a total of 114051\n",
+ "[INFO] Processed another 101 items, for a total of 114152\n",
+ "[INFO] Processed another 101 items, for a total of 114253\n",
+ "[INFO] Processed another 101 items, for a total of 114354\n",
+ "[INFO] Processed another 101 items, for a total of 114455\n",
+ "[INFO] Processed another 101 items, for a total of 114556\n",
+ "[INFO] Processed another 101 items, for a total of 114657\n",
+ "[INFO] Processed another 93 items, for a total of 114750\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1211/x121133;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 114851\n",
+ "[INFO] Processed another 101 items, for a total of 114952\n",
+ "[INFO] Processed another 101 items, for a total of 115053\n",
+ "[INFO] Processed another 101 items, for a total of 115154\n",
+ "[INFO] Processed another 101 items, for a total of 115255\n",
+ "[INFO] Processed another 101 items, for a total of 115356\n",
+ "[INFO] Processed another 101 items, for a total of 115457\n",
+ "[INFO] Processed another 101 items, for a total of 115558\n",
+ "[INFO] Processed another 101 items, for a total of 115659\n",
+ "[INFO] Processed another 101 items, for a total of 115760\n",
+ "[INFO] Processed another 40 items, for a total of 115800\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1212/x121233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 115901\n",
+ "[INFO] Processed another 101 items, for a total of 116002\n",
+ "[INFO] Processed another 101 items, for a total of 116103\n",
+ "[INFO] Processed another 101 items, for a total of 116204\n",
+ "[INFO] Processed another 101 items, for a total of 116305\n",
+ "[INFO] Processed another 101 items, for a total of 116406\n",
+ "[INFO] Processed another 101 items, for a total of 116507\n",
+ "[INFO] Processed another 101 items, for a total of 116608\n",
+ "[INFO] Processed another 101 items, for a total of 116709\n",
+ "[INFO] Processed another 101 items, for a total of 116810\n",
+ "[INFO] Processed another 101 items, for a total of 116911\n",
+ "[INFO] Processed another 101 items, for a total of 117012\n",
+ "[INFO] Processed another 40 items, for a total of 117052\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1301/x130133;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 117153\n",
+ "[INFO] Processed another 101 items, for a total of 117254\n",
+ "[INFO] Processed another 101 items, for a total of 117355\n",
+ "[INFO] Processed another 101 items, for a total of 117456\n",
+ "[INFO] Processed another 101 items, for a total of 117557\n",
+ "[INFO] Processed another 101 items, for a total of 117658\n",
+ "[INFO] Processed another 101 items, for a total of 117759\n",
+ "[INFO] Processed another 101 items, for a total of 117860\n",
+ "[INFO] Processed another 101 items, for a total of 117961\n",
+ "[INFO] Processed another 101 items, for a total of 118062\n",
+ "[INFO] Processed another 76 items, for a total of 118138\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1302/x130233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 118239\n",
+ "[INFO] Processed another 101 items, for a total of 118340\n",
+ "[INFO] Processed another 101 items, for a total of 118441\n",
+ "[INFO] Processed another 101 items, for a total of 118542\n",
+ "[INFO] Processed another 101 items, for a total of 118643\n",
+ "[INFO] Processed another 101 items, for a total of 118744\n",
+ "[INFO] Processed another 101 items, for a total of 118845\n",
+ "[INFO] Processed another 101 items, for a total of 118946\n",
+ "[INFO] Processed another 101 items, for a total of 119047\n",
+ "[INFO] Processed another 101 items, for a total of 119148\n",
+ "[INFO] Processed another 101 items, for a total of 119249\n",
+ "[INFO] Processed another 86 items, for a total of 119335\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1303/x130333;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 119436\n",
+ "[INFO] Processed another 101 items, for a total of 119537\n",
+ "[INFO] Processed another 101 items, for a total of 119638\n",
+ "[INFO] Processed another 101 items, for a total of 119739\n",
+ "[INFO] Processed another 101 items, for a total of 119840\n",
+ "[INFO] Processed another 101 items, for a total of 119941\n",
+ "[INFO] Processed another 101 items, for a total of 120042\n",
+ "[INFO] Processed another 101 items, for a total of 120143\n",
+ "[INFO] Processed another 101 items, for a total of 120244\n",
+ "[INFO] Processed another 101 items, for a total of 120345\n",
+ "[INFO] Processed another 101 items, for a total of 120446\n",
+ "[INFO] Processed another 34 items, for a total of 120480\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1304/x130433;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 120581\n",
+ "[INFO] Processed another 101 items, for a total of 120682\n",
+ "[INFO] Processed another 101 items, for a total of 120783\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 120884\n",
+ "[INFO] Processed another 101 items, for a total of 120985\n",
+ "[INFO] Processed another 101 items, for a total of 121086\n",
+ "[INFO] Processed another 101 items, for a total of 121187\n",
+ "[INFO] Processed another 101 items, for a total of 121288\n",
+ "[INFO] Processed another 101 items, for a total of 121389\n",
+ "[INFO] Processed another 5 items, for a total of 121394\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1305/x130533;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 121495\n",
+ "[INFO] Processed another 101 items, for a total of 121596\n",
+ "[INFO] Processed another 101 items, for a total of 121697\n",
+ "[INFO] Processed another 101 items, for a total of 121798\n",
+ "[INFO] Processed another 101 items, for a total of 121899\n",
+ "[INFO] Processed another 101 items, for a total of 122000\n",
+ "[INFO] Processed another 101 items, for a total of 122101\n",
+ "[INFO] Processed another 101 items, for a total of 122202\n",
+ "[INFO] Processed another 101 items, for a total of 122303\n",
+ "[INFO] Processed another 64 items, for a total of 122367\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1306/x130633;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 122468\n",
+ "[INFO] Processed another 101 items, for a total of 122569\n",
+ "[INFO] Processed another 101 items, for a total of 122670\n",
+ "[INFO] Processed another 101 items, for a total of 122771\n",
+ "[INFO] Processed another 101 items, for a total of 122872\n",
+ "[INFO] Processed another 101 items, for a total of 122973\n",
+ "[INFO] Processed another 101 items, for a total of 123074\n",
+ "[INFO] Processed another 101 items, for a total of 123175\n",
+ "[INFO] Processed another 101 items, for a total of 123276\n",
+ "[INFO] Processed another 24 items, for a total of 123300\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1307/x130733;1000,100.html\n",
+ "[INFO] Processed another 34 items, for a total of 123334\n",
+ "[INFO] Processed another 101 items, for a total of 123435\n",
+ "[INFO] Processed another 101 items, for a total of 123536\n",
+ "[INFO] Processed another 101 items, for a total of 123637\n",
+ "[INFO] Processed another 101 items, for a total of 123738\n",
+ "[INFO] Processed another 101 items, for a total of 123839\n",
+ "[INFO] Processed another 101 items, for a total of 123940\n",
+ "[INFO] Processed another 101 items, for a total of 124041\n",
+ "[INFO] Processed another 101 items, for a total of 124142\n",
+ "[INFO] Processed another 70 items, for a total of 124212\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1308/x130833;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 124313\n",
+ "[INFO] Processed another 101 items, for a total of 124414\n",
+ "[INFO] Processed another 101 items, for a total of 124515\n",
+ "[INFO] Processed another 101 items, for a total of 124616\n",
+ "[INFO] Processed another 101 items, for a total of 124717\n",
+ "[INFO] Processed another 101 items, for a total of 124818\n",
+ "[INFO] Processed another 101 items, for a total of 124919\n",
+ "[INFO] Processed another 101 items, for a total of 125020\n",
+ "[INFO] Processed another 101 items, for a total of 125121\n",
+ "[INFO] Processed another 12 items, for a total of 125133\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1309/x130933;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 125234\n",
+ "[INFO] Processed another 101 items, for a total of 125335\n",
+ "[INFO] Processed another 101 items, for a total of 125436\n",
+ "[INFO] Processed another 101 items, for a total of 125537\n",
+ "[INFO] Processed another 101 items, for a total of 125638\n",
+ "[INFO] Processed another 101 items, for a total of 125739\n",
+ "[INFO] Processed another 101 items, for a total of 125840\n",
+ "[INFO] Processed another 101 items, for a total of 125941\n",
+ "[INFO] Processed another 101 items, for a total of 126042\n",
+ "[INFO] Processed another 76 items, for a total of 126118\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1310/x131033;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 126219\n",
+ "[INFO] Processed another 101 items, for a total of 126320\n",
+ "[INFO] Processed another 101 items, for a total of 126421\n",
+ "[INFO] Processed another 101 items, for a total of 126522\n",
+ "[INFO] Processed another 101 items, for a total of 126623\n",
+ "[INFO] Processed another 101 items, for a total of 126724\n",
+ "[INFO] Processed another 101 items, for a total of 126825\n",
+ "[INFO] Processed another 101 items, for a total of 126926\n",
+ "[INFO] Processed another 101 items, for a total of 127027\n",
+ "[INFO] Processed another 101 items, for a total of 127128\n",
+ "[INFO] Processed another 4 items, for a total of 127132\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1311/x131133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 127233\n",
+ "[INFO] Processed another 101 items, for a total of 127334\n",
+ "[INFO] Processed another 101 items, for a total of 127435\n",
+ "[INFO] Processed another 101 items, for a total of 127536\n",
+ "[INFO] Processed another 101 items, for a total of 127637\n",
+ "[INFO] Processed another 101 items, for a total of 127738\n",
+ "[INFO] Processed another 101 items, for a total of 127839\n",
+ "[INFO] Processed another 101 items, for a total of 127940\n",
+ "[INFO] Processed another 101 items, for a total of 128041\n",
+ "[INFO] Processed another 101 items, for a total of 128142\n",
+ "[INFO] Processed another 101 items, for a total of 128243\n",
+ "[INFO] Processed another 51 items, for a total of 128294\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1312/x131233;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 128395\n",
+ "[INFO] Processed another 101 items, for a total of 128496\n",
+ "[INFO] Processed another 101 items, for a total of 128597\n",
+ "[INFO] Processed another 101 items, for a total of 128698\n",
+ "[INFO] Processed another 101 items, for a total of 128799\n",
+ "[INFO] Processed another 101 items, for a total of 128900\n",
+ "[INFO] Processed another 101 items, for a total of 129001\n",
+ "[INFO] Processed another 101 items, for a total of 129102\n",
+ "[INFO] Processed another 101 items, for a total of 129203\n",
+ "[INFO] Processed another 24 items, for a total of 129227\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1401/x140133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 129328\n",
+ "[INFO] Processed another 101 items, for a total of 129429\n",
+ "[INFO] Processed another 101 items, for a total of 129530\n",
+ "[INFO] Processed another 101 items, for a total of 129631\n",
+ "[INFO] Processed another 101 items, for a total of 129732\n",
+ "[INFO] Processed another 101 items, for a total of 129833\n",
+ "[INFO] Processed another 101 items, for a total of 129934\n",
+ "[INFO] Processed another 101 items, for a total of 130035\n",
+ "[INFO] Processed another 101 items, for a total of 130136\n",
+ "[INFO] Processed another 101 items, for a total of 130237\n",
+ "[INFO] Processed another 31 items, for a total of 130268\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1402/x140233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 130369\n",
+ "[INFO] Processed another 101 items, for a total of 130470\n",
+ "[INFO] Processed another 101 items, for a total of 130571\n",
+ "[INFO] Processed another 101 items, for a total of 130672\n",
+ "[INFO] Processed another 101 items, for a total of 130773\n",
+ "[INFO] Processed another 101 items, for a total of 130874\n",
+ "[INFO] Processed another 101 items, for a total of 130975\n",
+ "[INFO] Processed another 101 items, for a total of 131076\n",
+ "[INFO] Processed another 101 items, for a total of 131177\n",
+ "[INFO] Processed another 101 items, for a total of 131278\n",
+ "[INFO] Processed another 101 items, for a total of 131379\n",
+ "[INFO] Processed another 101 items, for a total of 131480\n",
+ "[INFO] Processed another 101 items, for a total of 131581\n",
+ "[INFO] Processed another 101 items, for a total of 131682\n",
+ "[INFO] Processed another 101 items, for a total of 131783\n",
+ "[INFO] Processed another 101 items, for a total of 131884\n",
+ "[INFO] Processed another 101 items, for a total of 131985\n",
+ "[INFO] Processed another 23 items, for a total of 132008\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1403/x140333;1800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 132109\n",
+ "[INFO] Processed another 101 items, for a total of 132210\n",
+ "[INFO] Processed another 101 items, for a total of 132311\n",
+ "[INFO] Processed another 101 items, for a total of 132412\n",
+ "[INFO] Processed another 101 items, for a total of 132513\n",
+ "[INFO] Processed another 101 items, for a total of 132614\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 132715\n",
+ "[INFO] Processed another 101 items, for a total of 132816\n",
+ "[INFO] Processed another 101 items, for a total of 132917\n",
+ "[INFO] Processed another 101 items, for a total of 133018\n",
+ "[INFO] Processed another 101 items, for a total of 133119\n",
+ "[INFO] Processed another 101 items, for a total of 133220\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1404/x140433;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 133321\n",
+ "[INFO] Processed another 101 items, for a total of 133422\n",
+ "[INFO] Processed another 101 items, for a total of 133523\n",
+ "[INFO] Processed another 101 items, for a total of 133624\n",
+ "[INFO] Processed another 101 items, for a total of 133725\n",
+ "[INFO] Processed another 101 items, for a total of 133826\n",
+ "[INFO] Processed another 101 items, for a total of 133927\n",
+ "[INFO] Processed another 101 items, for a total of 134028\n",
+ "[INFO] Processed another 101 items, for a total of 134129\n",
+ "[INFO] Processed another 101 items, for a total of 134230\n",
+ "[INFO] Processed another 101 items, for a total of 134331\n",
+ "[INFO] Processed another 101 items, for a total of 134432\n",
+ "[INFO] Processed another 101 items, for a total of 134533\n",
+ "[INFO] Processed another 13 items, for a total of 134546\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1405/x140533;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 134647\n",
+ "[INFO] Processed another 101 items, for a total of 134748\n",
+ "[INFO] Processed another 101 items, for a total of 134849\n",
+ "[INFO] Processed another 101 items, for a total of 134950\n",
+ "[INFO] Processed another 101 items, for a total of 135051\n",
+ "[INFO] Processed another 101 items, for a total of 135152\n",
+ "[INFO] Processed another 101 items, for a total of 135253\n",
+ "[INFO] Processed another 101 items, for a total of 135354\n",
+ "[INFO] Processed another 101 items, for a total of 135455\n",
+ "[INFO] Processed another 101 items, for a total of 135556\n",
+ "[INFO] Processed another 6 items, for a total of 135562\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1406/x140633;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 135663\n",
+ "[INFO] Processed another 101 items, for a total of 135764\n",
+ "[INFO] Processed another 101 items, for a total of 135865\n",
+ "[INFO] Processed another 101 items, for a total of 135966\n",
+ "[INFO] Processed another 101 items, for a total of 136067\n",
+ "[INFO] Processed another 101 items, for a total of 136168\n",
+ "[INFO] Processed another 101 items, for a total of 136269\n",
+ "[INFO] Processed another 101 items, for a total of 136370\n",
+ "[INFO] Processed another 101 items, for a total of 136471\n",
+ "[INFO] Processed another 70 items, for a total of 136541\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1407/x140733;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 136642\n",
+ "[INFO] Processed another 101 items, for a total of 136743\n",
+ "[INFO] Processed another 101 items, for a total of 136844\n",
+ "[INFO] Processed another 101 items, for a total of 136945\n",
+ "[INFO] Processed another 101 items, for a total of 137046\n",
+ "[INFO] Processed another 101 items, for a total of 137147\n",
+ "[INFO] Processed another 101 items, for a total of 137248\n",
+ "[INFO] Processed another 101 items, for a total of 137349\n",
+ "[INFO] Processed another 101 items, for a total of 137450\n",
+ "[INFO] Processed another 87 items, for a total of 137537\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1408/x140833;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 137638\n",
+ "[INFO] Processed another 101 items, for a total of 137739\n",
+ "[INFO] Processed another 101 items, for a total of 137840\n",
+ "[INFO] Processed another 101 items, for a total of 137941\n",
+ "[INFO] Processed another 101 items, for a total of 138042\n",
+ "[INFO] Processed another 101 items, for a total of 138143\n",
+ "[INFO] Processed another 101 items, for a total of 138244\n",
+ "[INFO] Processed another 101 items, for a total of 138345\n",
+ "[INFO] Processed another 89 items, for a total of 138434\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1409/x140933;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 138535\n",
+ "[INFO] Processed another 101 items, for a total of 138636\n",
+ "[INFO] Processed another 101 items, for a total of 138737\n",
+ "[INFO] Processed another 101 items, for a total of 138838\n",
+ "[INFO] Processed another 101 items, for a total of 138939\n",
+ "[INFO] Processed another 101 items, for a total of 139040\n",
+ "[INFO] Processed another 101 items, for a total of 139141\n",
+ "[INFO] Processed another 101 items, for a total of 139242\n",
+ "[INFO] Processed another 101 items, for a total of 139343\n",
+ "[INFO] Processed another 33 items, for a total of 139376\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1410/x141033;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 139477\n",
+ "[INFO] Processed another 101 items, for a total of 139578\n",
+ "[INFO] Processed another 101 items, for a total of 139679\n",
+ "[INFO] Processed another 101 items, for a total of 139780\n",
+ "[INFO] Processed another 101 items, for a total of 139881\n",
+ "[INFO] Processed another 101 items, for a total of 139982\n",
+ "[INFO] Processed another 101 items, for a total of 140083\n",
+ "[INFO] Processed another 101 items, for a total of 140184\n",
+ "[INFO] Processed another 101 items, for a total of 140285\n",
+ "[INFO] Processed another 87 items, for a total of 140372\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1411/x141133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 140473\n",
+ "[INFO] Processed another 101 items, for a total of 140574\n",
+ "[INFO] Processed another 101 items, for a total of 140675\n",
+ "[INFO] Processed another 101 items, for a total of 140776\n",
+ "[INFO] Processed another 101 items, for a total of 140877\n",
+ "[INFO] Processed another 101 items, for a total of 140978\n",
+ "[INFO] Processed another 101 items, for a total of 141079\n",
+ "[INFO] Processed another 101 items, for a total of 141180\n",
+ "[INFO] Processed another 101 items, for a total of 141281\n",
+ "[INFO] Processed another 101 items, for a total of 141382\n",
+ "[INFO] Processed another 101 items, for a total of 141483\n",
+ "[INFO] Processed another 101 items, for a total of 141584\n",
+ "[INFO] Processed another 101 items, for a total of 141685\n",
+ "[INFO] Processed another 101 items, for a total of 141786\n",
+ "[INFO] Processed another 35 items, for a total of 141821\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1412/x141233;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 141922\n",
+ "[INFO] Processed another 101 items, for a total of 142023\n",
+ "[INFO] Processed another 101 items, for a total of 142124\n",
+ "[INFO] Processed another 101 items, for a total of 142225\n",
+ "[INFO] Processed another 101 items, for a total of 142326\n",
+ "[INFO] Processed another 101 items, for a total of 142427\n",
+ "[INFO] Processed another 101 items, for a total of 142528\n",
+ "[INFO] Processed another 101 items, for a total of 142629\n",
+ "[INFO] Processed another 101 items, for a total of 142730\n",
+ "[INFO] Processed another 101 items, for a total of 142831\n",
+ "[INFO] Processed another 101 items, for a total of 142932\n",
+ "[INFO] Processed another 44 items, for a total of 142976\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1501/x150133;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 143077\n",
+ "[INFO] Processed another 101 items, for a total of 143178\n",
+ "[INFO] Processed another 101 items, for a total of 143279\n",
+ "[INFO] Processed another 101 items, for a total of 143380\n",
+ "[INFO] Processed another 101 items, for a total of 143481\n",
+ "[INFO] Processed another 101 items, for a total of 143582\n",
+ "[INFO] Processed another 101 items, for a total of 143683\n",
+ "[INFO] Processed another 101 items, for a total of 143784\n",
+ "[INFO] Processed another 101 items, for a total of 143885\n",
+ "[INFO] Processed another 101 items, for a total of 143986\n",
+ "[INFO] Processed another 101 items, for a total of 144087\n",
+ "[INFO] Processed another 101 items, for a total of 144188\n",
+ "[INFO] Processed another 101 items, for a total of 144289\n",
+ "[INFO] Processed another 64 items, for a total of 144353\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1502/x150233;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 144454\n",
+ "[INFO] Processed another 101 items, for a total of 144555\n",
+ "[INFO] Processed another 101 items, for a total of 144656\n",
+ "[INFO] Processed another 101 items, for a total of 144757\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 144858\n",
+ "[INFO] Processed another 101 items, for a total of 144959\n",
+ "[INFO] Processed another 101 items, for a total of 145060\n",
+ "[INFO] Processed another 101 items, for a total of 145161\n",
+ "[INFO] Processed another 101 items, for a total of 145262\n",
+ "[INFO] Processed another 101 items, for a total of 145363\n",
+ "[INFO] Processed another 77 items, for a total of 145440\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1503/x150333;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 145541\n",
+ "[INFO] Processed another 101 items, for a total of 145642\n",
+ "[INFO] Processed another 101 items, for a total of 145743\n",
+ "[INFO] Processed another 101 items, for a total of 145844\n",
+ "[INFO] Processed another 101 items, for a total of 145945\n",
+ "[INFO] Processed another 101 items, for a total of 146046\n",
+ "[INFO] Processed another 101 items, for a total of 146147\n",
+ "[INFO] Processed another 101 items, for a total of 146248\n",
+ "[INFO] Processed another 101 items, for a total of 146349\n",
+ "[INFO] Processed another 101 items, for a total of 146450\n",
+ "[INFO] Processed another 101 items, for a total of 146551\n",
+ "[INFO] Processed another 3 items, for a total of 146554\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1504/x150433;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 146655\n",
+ "[INFO] Processed another 101 items, for a total of 146756\n",
+ "[INFO] Processed another 101 items, for a total of 146857\n",
+ "[INFO] Processed another 101 items, for a total of 146958\n",
+ "[INFO] Processed another 101 items, for a total of 147059\n",
+ "[INFO] Processed another 101 items, for a total of 147160\n",
+ "[INFO] Processed another 101 items, for a total of 147261\n",
+ "[INFO] Processed another 101 items, for a total of 147362\n",
+ "[INFO] Processed another 101 items, for a total of 147463\n",
+ "[INFO] Processed another 26 items, for a total of 147489\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1505/x150533;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 147590\n",
+ "[INFO] Processed another 101 items, for a total of 147691\n",
+ "[INFO] Processed another 101 items, for a total of 147792\n",
+ "[INFO] Processed another 101 items, for a total of 147893\n",
+ "[INFO] Processed another 101 items, for a total of 147994\n",
+ "[INFO] Processed another 101 items, for a total of 148095\n",
+ "[INFO] Processed another 101 items, for a total of 148196\n",
+ "[INFO] Processed another 101 items, for a total of 148297\n",
+ "[INFO] Processed another 60 items, for a total of 148357\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1506/x150633;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 148458\n",
+ "[INFO] Processed another 101 items, for a total of 148559\n",
+ "[INFO] Processed another 101 items, for a total of 148660\n",
+ "[INFO] Processed another 101 items, for a total of 148761\n",
+ "[INFO] Processed another 101 items, for a total of 148862\n",
+ "[INFO] Processed another 101 items, for a total of 148963\n",
+ "[INFO] Processed another 101 items, for a total of 149064\n",
+ "[INFO] Processed another 101 items, for a total of 149165\n",
+ "[INFO] Processed another 32 items, for a total of 149197\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1507/x150733;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 149298\n",
+ "[INFO] Processed another 101 items, for a total of 149399\n",
+ "[INFO] Processed another 101 items, for a total of 149500\n",
+ "[INFO] Processed another 101 items, for a total of 149601\n",
+ "[INFO] Processed another 101 items, for a total of 149702\n",
+ "[INFO] Processed another 101 items, for a total of 149803\n",
+ "[INFO] Processed another 101 items, for a total of 149904\n",
+ "[INFO] Processed another 101 items, for a total of 150005\n",
+ "[INFO] Processed another 101 items, for a total of 150106\n",
+ "[INFO] Processed another 101 items, for a total of 150207\n",
+ "[INFO] Processed another 86 items, for a total of 150293\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1508/x150833;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 150394\n",
+ "[INFO] Processed another 101 items, for a total of 150495\n",
+ "[INFO] Processed another 101 items, for a total of 150596\n",
+ "[INFO] Processed another 101 items, for a total of 150697\n",
+ "[INFO] Processed another 101 items, for a total of 150798\n",
+ "[INFO] Processed another 101 items, for a total of 150899\n",
+ "[INFO] Processed another 101 items, for a total of 151000\n",
+ "[INFO] Processed another 101 items, for a total of 151101\n",
+ "[INFO] Processed another 101 items, for a total of 151202\n",
+ "[INFO] Processed another 101 items, for a total of 151303\n",
+ "[INFO] Processed another 18 items, for a total of 151321\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1509/x150933;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 151422\n",
+ "[INFO] Processed another 101 items, for a total of 151523\n",
+ "[INFO] Processed another 101 items, for a total of 151624\n",
+ "[INFO] Processed another 101 items, for a total of 151725\n",
+ "[INFO] Processed another 101 items, for a total of 151826\n",
+ "[INFO] Processed another 101 items, for a total of 151927\n",
+ "[INFO] Processed another 101 items, for a total of 152028\n",
+ "[INFO] Processed another 101 items, for a total of 152129\n",
+ "[INFO] Processed another 101 items, for a total of 152230\n",
+ "[INFO] Processed another 101 items, for a total of 152331\n",
+ "[INFO] Processed another 94 items, for a total of 152425\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1510/x151033;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 152526\n",
+ "[INFO] Processed another 101 items, for a total of 152627\n",
+ "[INFO] Processed another 101 items, for a total of 152728\n",
+ "[INFO] Processed another 101 items, for a total of 152829\n",
+ "[INFO] Processed another 101 items, for a total of 152930\n",
+ "[INFO] Processed another 101 items, for a total of 153031\n",
+ "[INFO] Processed another 101 items, for a total of 153132\n",
+ "[INFO] Processed another 101 items, for a total of 153233\n",
+ "[INFO] Processed another 101 items, for a total of 153334\n",
+ "[INFO] Processed another 101 items, for a total of 153435\n",
+ "[INFO] Processed another 101 items, for a total of 153536\n",
+ "[INFO] Processed another 17 items, for a total of 153553\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1511/x151133;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 153654\n",
+ "[INFO] Processed another 101 items, for a total of 153755\n",
+ "[INFO] Processed another 101 items, for a total of 153856\n",
+ "[INFO] Processed another 101 items, for a total of 153957\n",
+ "[INFO] Processed another 101 items, for a total of 154058\n",
+ "[INFO] Processed another 101 items, for a total of 154159\n",
+ "[INFO] Processed another 101 items, for a total of 154260\n",
+ "[INFO] Processed another 101 items, for a total of 154361\n",
+ "[INFO] Processed another 101 items, for a total of 154462\n",
+ "[INFO] Processed another 101 items, for a total of 154563\n",
+ "[INFO] Processed another 101 items, for a total of 154664\n",
+ "[INFO] Processed another 101 items, for a total of 154765\n",
+ "[INFO] Processed another 101 items, for a total of 154866\n",
+ "[INFO] Processed another 101 items, for a total of 154967\n",
+ "[INFO] Processed another 18 items, for a total of 154985\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1512/x151233;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 155086\n",
+ "[INFO] Processed another 101 items, for a total of 155187\n",
+ "[INFO] Processed another 101 items, for a total of 155288\n",
+ "[INFO] Processed another 101 items, for a total of 155389\n",
+ "[INFO] Processed another 101 items, for a total of 155490\n",
+ "[INFO] Processed another 101 items, for a total of 155591\n",
+ "[INFO] Processed another 101 items, for a total of 155692\n",
+ "[INFO] Processed another 101 items, for a total of 155793\n",
+ "[INFO] Processed another 101 items, for a total of 155894\n",
+ "[INFO] Processed another 101 items, for a total of 155995\n",
+ "[INFO] Processed another 20 items, for a total of 156015\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1601/x160133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 156116\n",
+ "[INFO] Processed another 101 items, for a total of 156217\n",
+ "[INFO] Processed another 101 items, for a total of 156318\n",
+ "[INFO] Processed another 101 items, for a total of 156419\n",
+ "[INFO] Processed another 101 items, for a total of 156520\n",
+ "[INFO] Processed another 101 items, for a total of 156621\n",
+ "[INFO] Processed another 101 items, for a total of 156722\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 156823\n",
+ "[INFO] Processed another 101 items, for a total of 156924\n",
+ "[INFO] Processed another 76 items, for a total of 157000\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1602/x160233;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 157101\n",
+ "[INFO] Processed another 101 items, for a total of 157202\n",
+ "[INFO] Processed another 101 items, for a total of 157303\n",
+ "[INFO] Processed another 101 items, for a total of 157404\n",
+ "[INFO] Processed another 101 items, for a total of 157505\n",
+ "[INFO] Processed another 101 items, for a total of 157606\n",
+ "[INFO] Processed another 101 items, for a total of 157707\n",
+ "[INFO] Processed another 101 items, for a total of 157808\n",
+ "[INFO] Processed another 78 items, for a total of 157886\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1603/x160333;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 157987\n",
+ "[INFO] Processed another 101 items, for a total of 158088\n",
+ "[INFO] Processed another 101 items, for a total of 158189\n",
+ "[INFO] Processed another 101 items, for a total of 158290\n",
+ "[INFO] Processed another 101 items, for a total of 158391\n",
+ "[INFO] Processed another 101 items, for a total of 158492\n",
+ "[INFO] Processed another 101 items, for a total of 158593\n",
+ "[INFO] Processed another 101 items, for a total of 158694\n",
+ "[INFO] Processed another 101 items, for a total of 158795\n",
+ "[INFO] Processed another 19 items, for a total of 158814\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1604/x160433;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 158915\n",
+ "[INFO] Processed another 101 items, for a total of 159016\n",
+ "[INFO] Processed another 101 items, for a total of 159117\n",
+ "[INFO] Processed another 101 items, for a total of 159218\n",
+ "[INFO] Processed another 101 items, for a total of 159319\n",
+ "[INFO] Processed another 101 items, for a total of 159420\n",
+ "[INFO] Processed another 101 items, for a total of 159521\n",
+ "[INFO] Processed another 74 items, for a total of 159595\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1605/x160533;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 159696\n",
+ "[INFO] Processed another 101 items, for a total of 159797\n",
+ "[INFO] Processed another 101 items, for a total of 159898\n",
+ "[INFO] Processed another 101 items, for a total of 159999\n",
+ "[INFO] Processed another 101 items, for a total of 160100\n",
+ "[INFO] Processed another 101 items, for a total of 160201\n",
+ "[INFO] Processed another 101 items, for a total of 160302\n",
+ "[INFO] Processed another 101 items, for a total of 160403\n",
+ "[INFO] Processed another 101 items, for a total of 160504\n",
+ "[INFO] Processed another 70 items, for a total of 160574\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1606/x160633;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 160675\n",
+ "[INFO] Processed another 101 items, for a total of 160776\n",
+ "[INFO] Processed another 101 items, for a total of 160877\n",
+ "[INFO] Processed another 101 items, for a total of 160978\n",
+ "[INFO] Processed another 101 items, for a total of 161079\n",
+ "[INFO] Processed another 101 items, for a total of 161180\n",
+ "[INFO] Processed another 101 items, for a total of 161281\n",
+ "[INFO] Processed another 101 items, for a total of 161382\n",
+ "[INFO] Processed another 101 items, for a total of 161483\n",
+ "[INFO] Processed another 101 items, for a total of 161584\n",
+ "[INFO] Processed another 7 items, for a total of 161591\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1607/x160733;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 161692\n",
+ "[INFO] Processed another 101 items, for a total of 161793\n",
+ "[INFO] Processed another 101 items, for a total of 161894\n",
+ "[INFO] Processed another 101 items, for a total of 161995\n",
+ "[INFO] Processed another 101 items, for a total of 162096\n",
+ "[INFO] Processed another 101 items, for a total of 162197\n",
+ "[INFO] Processed another 101 items, for a total of 162298\n",
+ "[INFO] Processed another 101 items, for a total of 162399\n",
+ "[INFO] Processed another 36 items, for a total of 162435\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1608/x160833;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 162536\n",
+ "[INFO] Processed another 101 items, for a total of 162637\n",
+ "[INFO] Processed another 101 items, for a total of 162738\n",
+ "[INFO] Processed another 101 items, for a total of 162839\n",
+ "[INFO] Processed another 101 items, for a total of 162940\n",
+ "[INFO] Processed another 101 items, for a total of 163041\n",
+ "[INFO] Processed another 101 items, for a total of 163142\n",
+ "[INFO] Processed another 101 items, for a total of 163243\n",
+ "[INFO] Processed another 101 items, for a total of 163344\n",
+ "[INFO] Processed another 101 items, for a total of 163445\n",
+ "[INFO] Processed another 36 items, for a total of 163481\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1609/x160933;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 163582\n",
+ "[INFO] Processed another 101 items, for a total of 163683\n",
+ "[INFO] Processed another 101 items, for a total of 163784\n",
+ "[INFO] Processed another 101 items, for a total of 163885\n",
+ "[INFO] Processed another 101 items, for a total of 163986\n",
+ "[INFO] Processed another 101 items, for a total of 164087\n",
+ "[INFO] Processed another 101 items, for a total of 164188\n",
+ "[INFO] Processed another 101 items, for a total of 164289\n",
+ "[INFO] Processed another 101 items, for a total of 164390\n",
+ "[INFO] Processed another 47 items, for a total of 164437\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1610/x161033;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 164538\n",
+ "[INFO] Processed another 101 items, for a total of 164639\n",
+ "[INFO] Processed another 101 items, for a total of 164740\n",
+ "[INFO] Processed another 101 items, for a total of 164841\n",
+ "[INFO] Processed another 101 items, for a total of 164942\n",
+ "[INFO] Processed another 101 items, for a total of 165043\n",
+ "[INFO] Processed another 101 items, for a total of 165144\n",
+ "[INFO] Processed another 101 items, for a total of 165245\n",
+ "[INFO] Processed another 101 items, for a total of 165346\n",
+ "[INFO] Processed another 101 items, for a total of 165447\n",
+ "[INFO] Processed another 62 items, for a total of 165509\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1611/x161133;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 165610\n",
+ "[INFO] Processed another 101 items, for a total of 165711\n",
+ "[INFO] Processed another 101 items, for a total of 165812\n",
+ "[INFO] Processed another 101 items, for a total of 165913\n",
+ "[INFO] Processed another 101 items, for a total of 166014\n",
+ "[INFO] Processed another 101 items, for a total of 166115\n",
+ "[INFO] Processed another 101 items, for a total of 166216\n",
+ "[INFO] Processed another 101 items, for a total of 166317\n",
+ "[INFO] Processed another 101 items, for a total of 166418\n",
+ "[INFO] Processed another 83 items, for a total of 166501\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1612/x161233;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 166602\n",
+ "[INFO] Processed another 101 items, for a total of 166703\n",
+ "[INFO] Processed another 101 items, for a total of 166804\n",
+ "[INFO] Processed another 101 items, for a total of 166905\n",
+ "[INFO] Processed another 101 items, for a total of 167006\n",
+ "[INFO] Processed another 101 items, for a total of 167107\n",
+ "[INFO] Processed another 101 items, for a total of 167208\n",
+ "[INFO] Processed another 101 items, for a total of 167309\n",
+ "[INFO] Processed another 101 items, for a total of 167410\n",
+ "[INFO] Processed another 16 items, for a total of 167426\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1701/x170133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 167527\n",
+ "[INFO] Processed another 101 items, for a total of 167628\n",
+ "[INFO] Processed another 101 items, for a total of 167729\n",
+ "[INFO] Processed another 101 items, for a total of 167830\n",
+ "[INFO] Processed another 101 items, for a total of 167931\n",
+ "[INFO] Processed another 101 items, for a total of 168032\n",
+ "[INFO] Processed another 101 items, for a total of 168133\n",
+ "[INFO] Processed another 100 items, for a total of 168233\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1702/x170233;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 168334\n",
+ "[INFO] Processed another 101 items, for a total of 168435\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 168536\n",
+ "[INFO] Processed another 101 items, for a total of 168637\n",
+ "[INFO] Processed another 101 items, for a total of 168738\n",
+ "[INFO] Processed another 101 items, for a total of 168839\n",
+ "[INFO] Processed another 101 items, for a total of 168940\n",
+ "[INFO] Processed another 101 items, for a total of 169041\n",
+ "[INFO] Processed another 101 items, for a total of 169142\n",
+ "[INFO] Processed another 101 items, for a total of 169243\n",
+ "[INFO] Processed another 5 items, for a total of 169248\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1703/x170333;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 169349\n",
+ "[INFO] Processed another 101 items, for a total of 169450\n",
+ "[INFO] Processed another 101 items, for a total of 169551\n",
+ "[INFO] Processed another 101 items, for a total of 169652\n",
+ "[INFO] Processed another 101 items, for a total of 169753\n",
+ "[INFO] Processed another 101 items, for a total of 169854\n",
+ "[INFO] Processed another 101 items, for a total of 169955\n",
+ "[INFO] Processed another 101 items, for a total of 170056\n",
+ "[INFO] Processed another 101 items, for a total of 170157\n",
+ "[INFO] Processed another 82 items, for a total of 170239\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1704/x170433;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 170340\n",
+ "[INFO] Processed another 101 items, for a total of 170441\n",
+ "[INFO] Processed another 101 items, for a total of 170542\n",
+ "[INFO] Processed another 101 items, for a total of 170643\n",
+ "[INFO] Processed another 101 items, for a total of 170744\n",
+ "[INFO] Processed another 101 items, for a total of 170845\n",
+ "[INFO] Processed another 101 items, for a total of 170946\n",
+ "[INFO] Processed another 101 items, for a total of 171047\n",
+ "[INFO] Processed another 101 items, for a total of 171148\n",
+ "[INFO] Processed another 8 items, for a total of 171156\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1705/x170533;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 171257\n",
+ "[INFO] Processed another 101 items, for a total of 171358\n",
+ "[INFO] Processed another 101 items, for a total of 171459\n",
+ "[INFO] Processed another 101 items, for a total of 171560\n",
+ "[INFO] Processed another 101 items, for a total of 171661\n",
+ "[INFO] Processed another 101 items, for a total of 171762\n",
+ "[INFO] Processed another 101 items, for a total of 171863\n",
+ "[INFO] Processed another 101 items, for a total of 171964\n",
+ "[INFO] Processed another 70 items, for a total of 172034\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1706/x170633;900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 172135\n",
+ "[INFO] Processed another 101 items, for a total of 172236\n",
+ "[INFO] Processed another 101 items, for a total of 172337\n",
+ "[INFO] Processed another 101 items, for a total of 172438\n",
+ "[INFO] Processed another 101 items, for a total of 172539\n",
+ "[INFO] Processed another 101 items, for a total of 172640\n",
+ "[INFO] Processed another 101 items, for a total of 172741\n",
+ "[INFO] Processed another 101 items, for a total of 172842\n",
+ "[INFO] Processed another 101 items, for a total of 172943\n",
+ "[INFO] Processed another 27 items, for a total of 172970\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1707/x170733;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 173071\n",
+ "[INFO] Processed another 101 items, for a total of 173172\n",
+ "[INFO] Processed another 101 items, for a total of 173273\n",
+ "[INFO] Processed another 101 items, for a total of 173374\n",
+ "[INFO] Processed another 101 items, for a total of 173475\n",
+ "[INFO] Processed another 101 items, for a total of 173576\n",
+ "[INFO] Processed another 101 items, for a total of 173677\n",
+ "[INFO] Processed another 101 items, for a total of 173778\n",
+ "[INFO] Processed another 101 items, for a total of 173879\n",
+ "[INFO] Processed another 101 items, for a total of 173980\n",
+ "[INFO] Processed another 47 items, for a total of 174027\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1708/x170833;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 174128\n",
+ "[INFO] Processed another 101 items, for a total of 174229\n",
+ "[INFO] Processed another 101 items, for a total of 174330\n",
+ "[INFO] Processed another 101 items, for a total of 174431\n",
+ "[INFO] Processed another 101 items, for a total of 174532\n",
+ "[INFO] Processed another 101 items, for a total of 174633\n",
+ "[INFO] Processed another 101 items, for a total of 174734\n",
+ "[INFO] Processed another 101 items, for a total of 174835\n",
+ "[INFO] Processed another 101 items, for a total of 174936\n",
+ "[INFO] Processed another 85 items, for a total of 175021\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1709/x170933;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 175122\n",
+ "[INFO] Processed another 101 items, for a total of 175223\n",
+ "[INFO] Processed another 101 items, for a total of 175324\n",
+ "[INFO] Processed another 101 items, for a total of 175425\n",
+ "[INFO] Processed another 101 items, for a total of 175526\n",
+ "[INFO] Processed another 101 items, for a total of 175627\n",
+ "[INFO] Processed another 101 items, for a total of 175728\n",
+ "[INFO] Processed another 101 items, for a total of 175829\n",
+ "[INFO] Processed another 101 items, for a total of 175930\n",
+ "[INFO] Processed another 101 items, for a total of 176031\n",
+ "[INFO] Processed another 86 items, for a total of 176117\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1710/x171033;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 176218\n",
+ "[INFO] Processed another 101 items, for a total of 176319\n",
+ "[INFO] Processed another 101 items, for a total of 176420\n",
+ "[INFO] Processed another 101 items, for a total of 176521\n",
+ "[INFO] Processed another 101 items, for a total of 176622\n",
+ "[INFO] Processed another 101 items, for a total of 176723\n",
+ "[INFO] Processed another 101 items, for a total of 176824\n",
+ "[INFO] Processed another 101 items, for a total of 176925\n",
+ "[INFO] Processed another 101 items, for a total of 177026\n",
+ "[INFO] Processed another 79 items, for a total of 177105\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1711/x171133;1000,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 177206\n",
+ "[INFO] Processed another 101 items, for a total of 177307\n",
+ "[INFO] Processed another 101 items, for a total of 177408\n",
+ "[INFO] Processed another 101 items, for a total of 177509\n",
+ "[INFO] Processed another 101 items, for a total of 177610\n",
+ "[INFO] Processed another 101 items, for a total of 177711\n",
+ "[INFO] Processed another 101 items, for a total of 177812\n",
+ "[INFO] Processed another 101 items, for a total of 177913\n",
+ "[INFO] Processed another 101 items, for a total of 178014\n",
+ "[INFO] Processed another 101 items, for a total of 178115\n",
+ "[INFO] Processed another 5 items, for a total of 178120\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1712/x171233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 178221\n",
+ "[INFO] Processed another 101 items, for a total of 178322\n",
+ "[INFO] Processed another 101 items, for a total of 178423\n",
+ "[INFO] Processed another 101 items, for a total of 178524\n",
+ "[INFO] Processed another 101 items, for a total of 178625\n",
+ "[INFO] Processed another 101 items, for a total of 178726\n",
+ "[INFO] Processed another 101 items, for a total of 178827\n",
+ "[INFO] Processed another 37 items, for a total of 178864\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1801/x180133;800,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 178965\n",
+ "[INFO] Processed another 101 items, for a total of 179066\n",
+ "[INFO] Processed another 101 items, for a total of 179167\n",
+ "[INFO] Processed another 101 items, for a total of 179268\n",
+ "[INFO] Processed another 101 items, for a total of 179369\n",
+ "[INFO] Processed another 101 items, for a total of 179470\n",
+ "[INFO] Processed another 101 items, for a total of 179571\n",
+ "[INFO] Processed another 101 items, for a total of 179672\n",
+ "[INFO] Processed another 101 items, for a total of 179773\n",
+ "[INFO] Processed another 101 items, for a total of 179874\n",
+ "[INFO] Processed another 92 items, for a total of 179966\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1802/x180233;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 180067\n",
+ "[INFO] Processed another 101 items, for a total of 180168\n",
+ "[INFO] Processed another 101 items, for a total of 180269\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 180370\n",
+ "[INFO] Processed another 101 items, for a total of 180471\n",
+ "[INFO] Processed another 101 items, for a total of 180572\n",
+ "[INFO] Processed another 101 items, for a total of 180673\n",
+ "[INFO] Processed another 101 items, for a total of 180774\n",
+ "[INFO] Processed another 101 items, for a total of 180875\n",
+ "[INFO] Processed another 101 items, for a total of 180976\n",
+ "[INFO] Processed another 101 items, for a total of 181077\n",
+ "[INFO] Processed another 101 items, for a total of 181178\n",
+ "[INFO] Processed another 92 items, for a total of 181270\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1803/x180333;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 181371\n",
+ "[INFO] Processed another 101 items, for a total of 181472\n",
+ "[INFO] Processed another 101 items, for a total of 181573\n",
+ "[INFO] Processed another 101 items, for a total of 181674\n",
+ "[INFO] Processed another 101 items, for a total of 181775\n",
+ "[INFO] Processed another 101 items, for a total of 181876\n",
+ "[INFO] Processed another 101 items, for a total of 181977\n",
+ "[INFO] Processed another 101 items, for a total of 182078\n",
+ "[INFO] Processed another 101 items, for a total of 182179\n",
+ "[INFO] Processed another 101 items, for a total of 182280\n",
+ "[INFO] Processed another 101 items, for a total of 182381\n",
+ "[INFO] Processed another 101 items, for a total of 182482\n",
+ "[INFO] Processed another 101 items, for a total of 182583\n",
+ "[INFO] Processed another 82 items, for a total of 182665\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1804/x180433;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 182766\n",
+ "[INFO] Processed another 101 items, for a total of 182867\n",
+ "[INFO] Processed another 101 items, for a total of 182968\n",
+ "[INFO] Processed another 101 items, for a total of 183069\n",
+ "[INFO] Processed another 101 items, for a total of 183170\n",
+ "[INFO] Processed another 101 items, for a total of 183271\n",
+ "[INFO] Processed another 101 items, for a total of 183372\n",
+ "[INFO] Processed another 101 items, for a total of 183473\n",
+ "[INFO] Processed another 101 items, for a total of 183574\n",
+ "[INFO] Processed another 101 items, for a total of 183675\n",
+ "[INFO] Processed another 101 items, for a total of 183776\n",
+ "[INFO] Processed another 52 items, for a total of 183828\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1805/x180533;1200,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 183929\n",
+ "[INFO] Processed another 101 items, for a total of 184030\n",
+ "[INFO] Processed another 101 items, for a total of 184131\n",
+ "[INFO] Processed another 101 items, for a total of 184232\n",
+ "[INFO] Processed another 101 items, for a total of 184333\n",
+ "[INFO] Processed another 101 items, for a total of 184434\n",
+ "[INFO] Processed another 101 items, for a total of 184535\n",
+ "[INFO] Processed another 101 items, for a total of 184636\n",
+ "[INFO] Processed another 101 items, for a total of 184737\n",
+ "[INFO] Processed another 101 items, for a total of 184838\n",
+ "[INFO] Processed another 101 items, for a total of 184939\n",
+ "[INFO] Processed another 101 items, for a total of 185040\n",
+ "[INFO] Processed another 101 items, for a total of 185141\n",
+ "[INFO] Processed another 101 items, for a total of 185242\n",
+ "[INFO] Processed another 101 items, for a total of 185343\n",
+ "[INFO] Processed another 96 items, for a total of 185439\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1806/x180633;1600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 185540\n",
+ "[INFO] Processed another 101 items, for a total of 185641\n",
+ "[INFO] Processed another 101 items, for a total of 185742\n",
+ "[INFO] Processed another 101 items, for a total of 185843\n",
+ "[INFO] Processed another 101 items, for a total of 185944\n",
+ "[INFO] Processed another 101 items, for a total of 186045\n",
+ "[INFO] Processed another 101 items, for a total of 186146\n",
+ "[INFO] Processed another 101 items, for a total of 186247\n",
+ "[INFO] Processed another 101 items, for a total of 186348\n",
+ "[INFO] Processed another 101 items, for a total of 186449\n",
+ "[INFO] Processed another 101 items, for a total of 186550\n",
+ "[INFO] Processed another 101 items, for a total of 186651\n",
+ "[INFO] Processed another 101 items, for a total of 186752\n",
+ "[INFO] Processed another 53 items, for a total of 186805\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1807/x180733;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 186906\n",
+ "[INFO] Processed another 101 items, for a total of 187007\n",
+ "[INFO] Processed another 101 items, for a total of 187108\n",
+ "[INFO] Processed another 101 items, for a total of 187209\n",
+ "[INFO] Processed another 101 items, for a total of 187310\n",
+ "[INFO] Processed another 101 items, for a total of 187411\n",
+ "[INFO] Processed another 101 items, for a total of 187512\n",
+ "[INFO] Processed another 101 items, for a total of 187613\n",
+ "[INFO] Processed another 101 items, for a total of 187714\n",
+ "[INFO] Processed another 101 items, for a total of 187815\n",
+ "[INFO] Processed another 101 items, for a total of 187916\n",
+ "[INFO] Processed another 101 items, for a total of 188017\n",
+ "[INFO] Processed another 89 items, for a total of 188106\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1808/x180833;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 188207\n",
+ "[INFO] Processed another 101 items, for a total of 188308\n",
+ "[INFO] Processed another 101 items, for a total of 188409\n",
+ "[INFO] Processed another 101 items, for a total of 188510\n",
+ "[INFO] Processed another 101 items, for a total of 188611\n",
+ "[INFO] Processed another 101 items, for a total of 188712\n",
+ "[INFO] Processed another 101 items, for a total of 188813\n",
+ "[INFO] Processed another 101 items, for a total of 188914\n",
+ "[INFO] Processed another 101 items, for a total of 189015\n",
+ "[INFO] Processed another 101 items, for a total of 189116\n",
+ "[INFO] Processed another 101 items, for a total of 189217\n",
+ "[INFO] Processed another 101 items, for a total of 189318\n",
+ "[INFO] Processed another 48 items, for a total of 189366\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1809/x180933;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 189467\n",
+ "[INFO] Processed another 101 items, for a total of 189568\n",
+ "[INFO] Processed another 101 items, for a total of 189669\n",
+ "[INFO] Processed another 101 items, for a total of 189770\n",
+ "[INFO] Processed another 101 items, for a total of 189871\n",
+ "[INFO] Processed another 101 items, for a total of 189972\n",
+ "[INFO] Processed another 101 items, for a total of 190073\n",
+ "[INFO] Processed another 101 items, for a total of 190174\n",
+ "[INFO] Processed another 101 items, for a total of 190275\n",
+ "[INFO] Processed another 101 items, for a total of 190376\n",
+ "[INFO] Processed another 101 items, for a total of 190477\n",
+ "[INFO] Processed another 101 items, for a total of 190578\n",
+ "[INFO] Processed another 101 items, for a total of 190679\n",
+ "[INFO] Processed another 101 items, for a total of 190780\n",
+ "[INFO] Processed another 29 items, for a total of 190809\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1810/x181033;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 190910\n",
+ "[INFO] Processed another 101 items, for a total of 191011\n",
+ "[INFO] Processed another 101 items, for a total of 191112\n",
+ "[INFO] Processed another 101 items, for a total of 191213\n",
+ "[INFO] Processed another 101 items, for a total of 191314\n",
+ "[INFO] Processed another 101 items, for a total of 191415\n",
+ "[INFO] Processed another 101 items, for a total of 191516\n",
+ "[INFO] Processed another 101 items, for a total of 191617\n",
+ "[INFO] Processed another 101 items, for a total of 191718\n",
+ "[INFO] Processed another 101 items, for a total of 191819\n",
+ "[INFO] Processed another 101 items, for a total of 191920\n",
+ "[INFO] Processed another 101 items, for a total of 192021\n",
+ "[INFO] Processed another 101 items, for a total of 192122\n",
+ "[INFO] Processed another 88 items, for a total of 192210\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1811/x181133;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 192311\n",
+ "[INFO] Processed another 101 items, for a total of 192412\n",
+ "[INFO] Processed another 101 items, for a total of 192513\n",
+ "[INFO] Processed another 101 items, for a total of 192614\n",
+ "[INFO] Processed another 101 items, for a total of 192715\n",
+ "[INFO] Processed another 101 items, for a total of 192816\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 192917\n",
+ "[INFO] Processed another 101 items, for a total of 193018\n",
+ "[INFO] Processed another 101 items, for a total of 193119\n",
+ "[INFO] Processed another 101 items, for a total of 193220\n",
+ "[INFO] Processed another 101 items, for a total of 193321\n",
+ "[INFO] Processed another 101 items, for a total of 193422\n",
+ "[INFO] Processed another 101 items, for a total of 193523\n",
+ "[INFO] Processed another 101 items, for a total of 193624\n",
+ "[INFO] Processed another 101 items, for a total of 193725\n",
+ "[INFO] Processed another 10 items, for a total of 193735\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1812/x181233;1600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 193836\n",
+ "[INFO] Processed another 101 items, for a total of 193937\n",
+ "[INFO] Processed another 101 items, for a total of 194038\n",
+ "[INFO] Processed another 101 items, for a total of 194139\n",
+ "[INFO] Processed another 101 items, for a total of 194240\n",
+ "[INFO] Processed another 101 items, for a total of 194341\n",
+ "[INFO] Processed another 101 items, for a total of 194442\n",
+ "[INFO] Processed another 101 items, for a total of 194543\n",
+ "[INFO] Processed another 101 items, for a total of 194644\n",
+ "[INFO] Processed another 101 items, for a total of 194745\n",
+ "[INFO] Processed another 101 items, for a total of 194846\n",
+ "[INFO] Processed another 101 items, for a total of 194947\n",
+ "[INFO] Processed another 75 items, for a total of 195022\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1901/x190133;1300,100.html\n",
+ "[INFO] Processed another 52 items, for a total of 195074\n",
+ "[INFO] Processed another 101 items, for a total of 195175\n",
+ "[INFO] Processed another 101 items, for a total of 195276\n",
+ "[INFO] Processed another 101 items, for a total of 195377\n",
+ "[INFO] Processed another 101 items, for a total of 195478\n",
+ "[INFO] Processed another 101 items, for a total of 195579\n",
+ "[INFO] Processed another 101 items, for a total of 195680\n",
+ "[INFO] Processed another 101 items, for a total of 195781\n",
+ "[INFO] Processed another 101 items, for a total of 195882\n",
+ "[INFO] Processed another 101 items, for a total of 195983\n",
+ "[INFO] Processed another 101 items, for a total of 196084\n",
+ "[INFO] Processed another 101 items, for a total of 196185\n",
+ "[INFO] Processed another 11 items, for a total of 196196\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1902/x190233;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 196297\n",
+ "[INFO] Processed another 101 items, for a total of 196398\n",
+ "[INFO] Processed another 101 items, for a total of 196499\n",
+ "[INFO] Processed another 101 items, for a total of 196600\n",
+ "[INFO] Processed another 101 items, for a total of 196701\n",
+ "[INFO] Processed another 101 items, for a total of 196802\n",
+ "[INFO] Processed another 101 items, for a total of 196903\n",
+ "[INFO] Processed another 101 items, for a total of 197004\n",
+ "[INFO] Processed another 101 items, for a total of 197105\n",
+ "[INFO] Processed another 101 items, for a total of 197206\n",
+ "[INFO] Processed another 101 items, for a total of 197307\n",
+ "[INFO] Processed another 101 items, for a total of 197408\n",
+ "[INFO] Processed another 101 items, for a total of 197509\n",
+ "[INFO] Processed another 101 items, for a total of 197610\n",
+ "[INFO] Processed another 88 items, for a total of 197698\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1903/x190333;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 197799\n",
+ "[INFO] Processed another 101 items, for a total of 197900\n",
+ "[INFO] Processed another 101 items, for a total of 198001\n",
+ "[INFO] Processed another 101 items, for a total of 198102\n",
+ "[INFO] Processed another 101 items, for a total of 198203\n",
+ "[INFO] Processed another 101 items, for a total of 198304\n",
+ "[INFO] Processed another 101 items, for a total of 198405\n",
+ "[INFO] Processed another 101 items, for a total of 198506\n",
+ "[INFO] Processed another 101 items, for a total of 198607\n",
+ "[INFO] Processed another 101 items, for a total of 198708\n",
+ "[INFO] Processed another 101 items, for a total of 198809\n",
+ "[INFO] Processed another 101 items, for a total of 198910\n",
+ "[INFO] Processed another 101 items, for a total of 199011\n",
+ "[INFO] Processed another 101 items, for a total of 199112\n",
+ "[INFO] Processed another 101 items, for a total of 199213\n",
+ "[INFO] Processed another 101 items, for a total of 199314\n",
+ "[INFO] Processed another 42 items, for a total of 199356\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1904/x190433;1700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 199457\n",
+ "[INFO] Processed another 101 items, for a total of 199558\n",
+ "[INFO] Processed another 101 items, for a total of 199659\n",
+ "[INFO] Processed another 101 items, for a total of 199760\n",
+ "[INFO] Processed another 101 items, for a total of 199861\n",
+ "[INFO] Processed another 101 items, for a total of 199962\n",
+ "[INFO] Processed another 101 items, for a total of 200063\n",
+ "[INFO] Processed another 101 items, for a total of 200164\n",
+ "[INFO] Processed another 101 items, for a total of 200265\n",
+ "[INFO] Processed another 101 items, for a total of 200366\n",
+ "[INFO] Processed another 101 items, for a total of 200467\n",
+ "[INFO] Processed another 101 items, for a total of 200568\n",
+ "[INFO] Processed another 101 items, for a total of 200669\n",
+ "[INFO] Processed another 101 items, for a total of 200770\n",
+ "[INFO] Processed another 21 items, for a total of 200791\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1905/x190533;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 200892\n",
+ "[INFO] Processed another 101 items, for a total of 200993\n",
+ "[INFO] Processed another 101 items, for a total of 201094\n",
+ "[INFO] Processed another 101 items, for a total of 201195\n",
+ "[INFO] Processed another 101 items, for a total of 201296\n",
+ "[INFO] Processed another 101 items, for a total of 201397\n",
+ "[INFO] Processed another 101 items, for a total of 201498\n",
+ "[INFO] Processed another 101 items, for a total of 201599\n",
+ "[INFO] Processed another 101 items, for a total of 201700\n",
+ "[INFO] Processed another 101 items, for a total of 201801\n",
+ "[INFO] Processed another 101 items, for a total of 201902\n",
+ "[INFO] Processed another 101 items, for a total of 202003\n",
+ "[INFO] Processed another 101 items, for a total of 202104\n",
+ "[INFO] Processed another 101 items, for a total of 202205\n",
+ "[INFO] Processed another 2 items, for a total of 202207\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1906/x190633;1500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 202308\n",
+ "[INFO] Processed another 101 items, for a total of 202409\n",
+ "[INFO] Processed another 101 items, for a total of 202510\n",
+ "[INFO] Processed another 101 items, for a total of 202611\n",
+ "[INFO] Processed another 101 items, for a total of 202712\n",
+ "[INFO] Processed another 101 items, for a total of 202813\n",
+ "[INFO] Processed another 101 items, for a total of 202914\n",
+ "[INFO] Processed another 101 items, for a total of 203015\n",
+ "[INFO] Processed another 101 items, for a total of 203116\n",
+ "[INFO] Processed another 101 items, for a total of 203217\n",
+ "[INFO] Processed another 2 items, for a total of 203219\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1907/x190733;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 203320\n",
+ "[INFO] Processed another 101 items, for a total of 203421\n",
+ "[INFO] Processed another 101 items, for a total of 203522\n",
+ "[INFO] Processed another 101 items, for a total of 203623\n",
+ "[INFO] Processed another 101 items, for a total of 203724\n",
+ "[INFO] Processed another 101 items, for a total of 203825\n",
+ "[INFO] Processed another 101 items, for a total of 203926\n",
+ "[INFO] Processed another 101 items, for a total of 204027\n",
+ "[INFO] Processed another 101 items, for a total of 204128\n",
+ "[INFO] Processed another 101 items, for a total of 204229\n",
+ "[INFO] Processed another 37 items, for a total of 204266\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1908/x190833;1100,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 204367\n",
+ "[INFO] Processed another 101 items, for a total of 204468\n",
+ "[INFO] Processed another 101 items, for a total of 204569\n",
+ "[INFO] Processed another 101 items, for a total of 204670\n",
+ "[INFO] Processed another 101 items, for a total of 204771\n",
+ "[INFO] Processed another 101 items, for a total of 204872\n",
+ "[INFO] Processed another 101 items, for a total of 204973\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 205074\n",
+ "[INFO] Processed another 101 items, for a total of 205175\n",
+ "[INFO] Processed another 101 items, for a total of 205276\n",
+ "[INFO] Processed another 101 items, for a total of 205377\n",
+ "[INFO] Processed another 81 items, for a total of 205458\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1909/x190933;1200,100.html\n",
+ "[INFO] Processed another 43 items, for a total of 205501\n",
+ "[INFO] Processed another 101 items, for a total of 205602\n",
+ "[INFO] Processed another 101 items, for a total of 205703\n",
+ "[INFO] Processed another 101 items, for a total of 205804\n",
+ "[INFO] Processed another 101 items, for a total of 205905\n",
+ "[INFO] Processed another 101 items, for a total of 206006\n",
+ "[INFO] Processed another 101 items, for a total of 206107\n",
+ "[INFO] Processed another 101 items, for a total of 206208\n",
+ "[INFO] Processed another 101 items, for a total of 206309\n",
+ "[INFO] Processed another 101 items, for a total of 206410\n",
+ "[INFO] Processed another 101 items, for a total of 206511\n",
+ "[INFO] Processed another 101 items, for a total of 206612\n",
+ "[INFO] Processed another 18 items, for a total of 206630\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1910/x191033;1300,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 206731\n",
+ "[INFO] Processed another 101 items, for a total of 206832\n",
+ "[INFO] Processed another 101 items, for a total of 206933\n",
+ "[INFO] Processed another 101 items, for a total of 207034\n",
+ "[INFO] Processed another 101 items, for a total of 207135\n",
+ "[INFO] Processed another 101 items, for a total of 207236\n",
+ "[INFO] Processed another 101 items, for a total of 207337\n",
+ "[INFO] Processed another 101 items, for a total of 207438\n",
+ "[INFO] Processed another 101 items, for a total of 207539\n",
+ "[INFO] Processed another 101 items, for a total of 207640\n",
+ "[INFO] Processed another 101 items, for a total of 207741\n",
+ "[INFO] Processed another 101 items, for a total of 207842\n",
+ "[INFO] Processed another 101 items, for a total of 207943\n",
+ "[INFO] Processed another 74 items, for a total of 208017\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1911/x191133;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 208118\n",
+ "[INFO] Processed another 101 items, for a total of 208219\n",
+ "[INFO] Processed another 101 items, for a total of 208320\n",
+ "[INFO] Processed another 101 items, for a total of 208421\n",
+ "[INFO] Processed another 101 items, for a total of 208522\n",
+ "[INFO] Processed another 101 items, for a total of 208623\n",
+ "[INFO] Processed another 101 items, for a total of 208724\n",
+ "[INFO] Processed another 101 items, for a total of 208825\n",
+ "[INFO] Processed another 101 items, for a total of 208926\n",
+ "[INFO] Processed another 101 items, for a total of 209027\n",
+ "[INFO] Processed another 101 items, for a total of 209128\n",
+ "[INFO] Processed another 101 items, for a total of 209229\n",
+ "[INFO] Processed another 101 items, for a total of 209330\n",
+ "[INFO] Processed another 6 items, for a total of 209336\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an1912/x191233;1400,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 209437\n",
+ "[INFO] Processed another 101 items, for a total of 209538\n",
+ "[INFO] Processed another 101 items, for a total of 209639\n",
+ "[INFO] Processed another 101 items, for a total of 209740\n",
+ "[INFO] Processed another 101 items, for a total of 209841\n",
+ "[INFO] Processed another 101 items, for a total of 209942\n",
+ "[INFO] Processed another 101 items, for a total of 210043\n",
+ "[INFO] Processed another 101 items, for a total of 210144\n",
+ "[INFO] Processed another 101 items, for a total of 210245\n",
+ "[INFO] Processed another 101 items, for a total of 210346\n",
+ "[INFO] Processed another 101 items, for a total of 210447\n",
+ "[INFO] Processed another 101 items, for a total of 210548\n",
+ "[INFO] Processed another 101 items, for a total of 210649\n",
+ "[INFO] Processed another 101 items, for a total of 210750\n",
+ "[INFO] Processed another 101 items, for a total of 210851\n",
+ "[INFO] Processed another 101 items, for a total of 210952\n",
+ "[INFO] Processed another 55 items, for a total of 211007\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2001/x200133;1700,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 211108\n",
+ "[INFO] Processed another 101 items, for a total of 211209\n",
+ "[INFO] Processed another 101 items, for a total of 211310\n",
+ "[INFO] Processed another 101 items, for a total of 211411\n",
+ "[INFO] Processed another 101 items, for a total of 211512\n",
+ "[INFO] Processed another 101 items, for a total of 211613\n",
+ "[INFO] Processed another 101 items, for a total of 211714\n",
+ "[INFO] Processed another 101 items, for a total of 211815\n",
+ "[INFO] Processed another 101 items, for a total of 211916\n",
+ "[INFO] Processed another 101 items, for a total of 212017\n",
+ "[INFO] Processed another 101 items, for a total of 212118\n",
+ "[INFO] Processed another 101 items, for a total of 212219\n",
+ "[INFO] Processed another 101 items, for a total of 212320\n",
+ "[INFO] Processed another 101 items, for a total of 212421\n",
+ "[INFO] Processed another 101 items, for a total of 212522\n",
+ "[INFO] Processed another 101 items, for a total of 212623\n",
+ "[INFO] Processed another 101 items, for a total of 212724\n",
+ "[INFO] Processed another 101 items, for a total of 212825\n",
+ "[INFO] Processed another 14 items, for a total of 212839\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2002/x200233;1900,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 212940\n",
+ "[INFO] Processed another 101 items, for a total of 213041\n",
+ "[INFO] Processed another 101 items, for a total of 213142\n",
+ "[INFO] Processed another 101 items, for a total of 213243\n",
+ "[INFO] Processed another 101 items, for a total of 213344\n",
+ "[INFO] Processed another 101 items, for a total of 213445\n",
+ "[INFO] Processed another 101 items, for a total of 213546\n",
+ "[INFO] Processed another 101 items, for a total of 213647\n",
+ "[INFO] Processed another 101 items, for a total of 213748\n",
+ "[INFO] Processed another 101 items, for a total of 213849\n",
+ "[INFO] Processed another 101 items, for a total of 213950\n",
+ "[INFO] Processed another 101 items, for a total of 214051\n",
+ "[INFO] Processed another 101 items, for a total of 214152\n",
+ "[INFO] Processed another 101 items, for a total of 214253\n",
+ "[INFO] Processed another 101 items, for a total of 214354\n",
+ "[INFO] Processed another 101 items, for a total of 214455\n",
+ "[INFO] Processed another 101 items, for a total of 214556\n",
+ "[INFO] Processed another 101 items, for a total of 214657\n",
+ "[INFO] Processed another 101 items, for a total of 214758\n",
+ "[INFO] Processed another 101 items, for a total of 214859\n",
+ "[INFO] Processed another 101 items, for a total of 214960\n",
+ "[INFO] Processed another 101 items, for a total of 215061\n",
+ "[INFO] Processed another 101 items, for a total of 215162\n",
+ "[INFO] Processed another 101 items, for a total of 215263\n",
+ "[INFO] Processed another 101 items, for a total of 215364\n",
+ "[INFO] Processed another 25 items, for a total of 215389\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2003/x200333;2600,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 215490\n",
+ "[INFO] Processed another 101 items, for a total of 215591\n",
+ "[INFO] Processed another 101 items, for a total of 215692\n",
+ "[INFO] Processed another 101 items, for a total of 215793\n",
+ "[INFO] Processed another 101 items, for a total of 215894\n",
+ "[INFO] Processed another 101 items, for a total of 215995\n",
+ "[INFO] Processed another 101 items, for a total of 216096\n",
+ "[INFO] Processed another 101 items, for a total of 216197\n",
+ "[INFO] Processed another 101 items, for a total of 216298\n",
+ "[INFO] Processed another 101 items, for a total of 216399\n",
+ "[INFO] Processed another 101 items, for a total of 216500\n",
+ "[INFO] Processed another 101 items, for a total of 216601\n",
+ "[INFO] Processed another 101 items, for a total of 216702\n",
+ "[INFO] Processed another 101 items, for a total of 216803\n",
+ "[INFO] Processed another 101 items, for a total of 216904\n",
+ "[INFO] Processed another 101 items, for a total of 217005\n",
+ "[INFO] Processed another 101 items, for a total of 217106\n",
+ "[INFO] Processed another 101 items, for a total of 217207\n",
+ "[INFO] Processed another 101 items, for a total of 217308\n",
+ "[INFO] Processed another 101 items, for a total of 217409\n",
+ "[INFO] Processed another 101 items, for a total of 217510\n",
+ "[INFO] Processed another 101 items, for a total of 217611\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[INFO] Processed another 101 items, for a total of 217712\n",
+ "[INFO] Processed another 101 items, for a total of 217813\n",
+ "[INFO] Processed another 79 items, for a total of 217892\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2004/x200433;2500,100.html\n",
+ "[INFO] Processed another 101 items, for a total of 217993\n",
+ "[INFO] Processed another 101 items, for a total of 218094\n",
+ "[INFO] Processed another 101 items, for a total of 218195\n",
+ "[INFO] Processed another 101 items, for a total of 218296\n",
+ "[INFO] Processed another 101 items, for a total of 218397\n",
+ "[INFO] Processed another 101 items, for a total of 218498\n",
+ "[INFO] Processed another 46 items, for a total of 218544\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2005/x200533;700,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2006/x200633;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2007/x200733;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2008/x200833;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2009/x200933;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2010/x201033;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2011/x201133;0,100.html\n",
+ "[INFO] No more jokes at: https://www.anekdot.ru/an/an2012/x201233;0,100.html\n"
+ ]
+ }
+ ],
+ "source": [
+ "def urls_anekdot(j_type, step, max_num_jokes, pbar):\n",
+ " # Шутки: j - свежие; s - повторные; x - остальные\n",
+ " # Истории: o - свежие;\n",
+ " URL = f'https://www.anekdot.ru/an/an{{}}{{}}/{j_type}{{}}{{}}33;{{}},100.html' # year;month;year;month;start_from\n",
+ " # All years from 1995 to 2020\n",
+ " years = list(map(lambda x: f'{x:0>2}', list(range(95, 100)) + list(range(0, 21))))\n",
+ " months = list(map(lambda x: f'{x:0>2}', list(range(1, 13))))\n",
+ " for year in years:\n",
+ " for month in months:\n",
+ " pbar.set_description(f'Processing {year} {month} page...')\n",
+ " for start_ind in range(0, max_num_jokes, step):\n",
+ " yield (URL.format(year, month, year, month, start_ind),\n",
+ " ((max_num - start_ind) // step) - 1)\n",
"\n",
- "n_batches = 40\n",
- "skip_downloaded = False\n",
- "n_processed = 0\n",
- "pbar = tqdm(total=n_batches)\n",
- "for i in range(n_batches):\n",
- " pbar.set_description('Loading {} batch...'.format(i+1))\n",
- "# time.sleep(1)\n",
- " page = requests.get(URL.format(i))\n",
- " pbar.set_description('Processing {} batch...'.format(i+1))\n",
+ "anekdot_jokes, n_processed = [], 0\n",
+ "pbar = tqdm(total=max_num * len(years) * len(months))\n",
+ "max_num, step = 4000, 100\n",
+ "iterator = urls_anekdot('x', step, max_num, pbar)\n",
+ "for page_url, n_skip in iterator:\n",
+ " page = requests.get(page_url, headers=headers)\n",
" soup = BeautifulSoup(page.content, 'html.parser')\n",
- " blocks = soup.body.findAll('div', 'fusion-post-content post-content')\n",
- " for j, block in enumerate(blocks):\n",
- " pbar.set_description('Processing {} block...'.format(j+1))\n",
- " block_title = block.find('h2', 'entry-title fusion-post-title').a\n",
- " transcript_url = block_title['href']\n",
- " file_name = transcript_url[:-1].rsplit('/', 1)[-1]\n",
- " file_path = os.path.join(output_path, file_name + '.txt')\n",
- " # Skip `bad` pages\n",
- " if transcript_url in pages_to_skip:\n",
- " if os.path.exists(file_path):\n",
- " os.remove(file_path)\n",
- " continue\n",
- " if not ('transcript' in block_title.contents[0].lower() or transcript_url in ok_pages):\n",
- " print('[WARN] Possibly page without transcript!', transcript_url)\n",
- " if not (os.path.exists(file_path) and skip_downloaded):\n",
- " try:\n",
- " scrap_transcript(transcript_url, file_path)\n",
- " except Exception:\n",
- " print('[ERROR] Some error on:', transcript_url)\n",
- " n_processed += len(blocks)\n",
- " print('[INFO] Processed another ', len(blocks), 'blocks, for a total of', n_processed)\n",
- " pbar.update(1)"
+ " new_items = [' '.join(process_block(joke)) for joke in soup.findAll('div', 'text')]\n",
+ " if len(new_items) == 0:\n",
+ " print('[INFO] No more jokes at:', page_url)\n",
+ " for _ in range(n_skip):\n",
+ " next(iterator)\n",
+ " pbar.update(100)\n",
+ " continue\n",
+ " anekdot_jokes.extend(new_items)\n",
+ " n_processed += len(new_items)\n",
+ " print('[INFO] Processed another ', len(new_items), 'items, for a total of', n_processed)\n",
+ " pbar.update(100)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "218544"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(anekdot_jokes)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# pd.DataFrame(anekdot_jokes, columns=['Text']).to_csv('../data/anekdot_others.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Telegram channel\n",
+ "https://t.me/ligaplohihshutok"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "with open('../data/liga-plohih-shutok.json', encoding='utf-8') as in_file:\n",
+ " liga_jokes = json.loads(in_file.read())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pd.DataFrame(liga_jokes, columns=['Text']).to_csv('../data/ru_lpsh_jokes.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Extract QA jokes\n",
+ "We can extract the QA jokes from the datasets with the general jokes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import nltk\n",
+ "import traceback\n",
+ "\n",
+ "regexps = [ # Regexp for the special chars\n",
+ " (re.compile('♦'), '*'),\n",
+ " (re.compile('\\n *\\n'), '\\n'), # Replace multiple newlines with one\n",
+ " (re.compile(r' {2,}'), ' '), # Replace multiple spaces with one\n",
+ "]\n",
+ "\n",
+ "def fix_text(s):\n",
+ " for regexp in regexps:\n",
+ " s = regexp[0].sub(regexp[1], s)\n",
+ " s = s.strip(' -—')\n",
+ " s = re.sub('^(?:вопрос|ответ):?', '', s, flags=re.IGNORECASE)\n",
+ " return s.strip()\n",
+ "\n",
+ "\n",
+ "def extract_qa_jokes(iterator, max_num_sents=2):\n",
+ " res = []\n",
+ " pbar = tqdm(total=len(iterator))\n",
+ " for i, joke in enumerate(iterator):\n",
+ " try:\n",
+ " joke = re.sub(r'^[^\\n\\:\\?\\.]*(?:Армянское|Армянскому?|Армянского) *радио[^\\n:?]*\\:', '', joke, flags=re.IGNORECASE)\n",
+ " sentences = [fix_text(s) for s in nltk.sent_tokenize(joke, language=\"russian\")]\n",
+ " sentences = [s for s in sentences if s]\n",
+ " if sentences and sentences[0][-1] == '?' and 1 < len(sentences) <= max_num_sents:\n",
+ " res.append({\n",
+ " 'Question': sentences[0],\n",
+ " 'Answer': ' '.join(sentences[1:])\n",
+ " })\n",
+ " except:\n",
+ " print(f'Error at {i}')\n",
+ " traceback.print_exc()\n",
+ " if i % 500 == 0:\n",
+ " pbar.set_description(f'Extracted: {len(res)} jokes')\n",
+ " pbar.update(1)\n",
+ " pbar.set_description(f'Extracted: {len(res)} jokes')\n",
+ " pbar.close()\n",
+ " return res"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "c2e9eff8880041ac849c0a346c42cf4e",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=20025.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "0559cb4a15674b7485ee7c0b14bbe7cb",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=200476.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "d8fb891271da4190bee04addb25e8c51",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=216603.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Error at 49407\n",
+ "Error at 64195\n",
+ "Error at 69074\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Traceback (most recent call last):\n",
+ " File \"\", line 23, in extract_qa_jokes\n",
+ " joke = re.sub(r'^[^\\n\\:\\?\\.]*(?:Армянское|Армянскому?|Армянского) *радио[^\\n:?]*\\:', '', joke, flags=re.IGNORECASE)\n",
+ " File \"C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\re.py\", line 192, in sub\n",
+ " return _compile(pattern, flags).sub(repl, string, count)\n",
+ "TypeError: expected string or bytes-like object\n",
+ "Traceback (most recent call last):\n",
+ " File \"\", line 23, in extract_qa_jokes\n",
+ " joke = re.sub(r'^[^\\n\\:\\?\\.]*(?:Армянское|Армянскому?|Армянского) *радио[^\\n:?]*\\:', '', joke, flags=re.IGNORECASE)\n",
+ " File \"C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\re.py\", line 192, in sub\n",
+ " return _compile(pattern, flags).sub(repl, string, count)\n",
+ "TypeError: expected string or bytes-like object\n",
+ "Traceback (most recent call last):\n",
+ " File \"\", line 23, in extract_qa_jokes\n",
+ " joke = re.sub(r'^[^\\n\\:\\?\\.]*(?:Армянское|Армянскому?|Армянского) *радио[^\\n:?]*\\:', '', joke, flags=re.IGNORECASE)\n",
+ " File \"C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\re.py\", line 192, in sub\n",
+ " return _compile(pattern, flags).sub(repl, string, count)\n",
+ "TypeError: expected string or bytes-like object\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Error at 132540\n",
+ "Error at 134146\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Traceback (most recent call last):\n",
+ " File \"\", line 23, in extract_qa_jokes\n",
+ " joke = re.sub(r'^[^\\n\\:\\?\\.]*(?:Армянское|Армянскому?|Армянского) *радио[^\\n:?]*\\:', '', joke, flags=re.IGNORECASE)\n",
+ " File \"C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\re.py\", line 192, in sub\n",
+ " return _compile(pattern, flags).sub(repl, string, count)\n",
+ "TypeError: expected string or bytes-like object\n",
+ "Traceback (most recent call last):\n",
+ " File \"\", line 23, in extract_qa_jokes\n",
+ " joke = re.sub(r'^[^\\n\\:\\?\\.]*(?:Армянское|Армянскому?|Армянского) *радио[^\\n:?]*\\:', '', joke, flags=re.IGNORECASE)\n",
+ " File \"C:\\Users\\Alex\\Anaconda3\\envs\\pytorch\\lib\\re.py\", line 192, in sub\n",
+ " return _compile(pattern, flags).sub(repl, string, count)\n",
+ "TypeError: expected string or bytes-like object\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "2c1cc2d3e29f4cdb99eba3da3a0fd0e8",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=218544.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "7e65de2cc55143ec9cc09615e1df4e88",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "HBox(children=(FloatProgress(value=0.0, max=1806.0), HTML(value='')))"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "67563"
+ ]
+ },
+ "execution_count": 62,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "files = [\n",
+ " '../data/anecdotika.csv',\n",
+ " '../data/anekdot_fresh.csv',\n",
+ " '../data/anekdot_repetative.csv',\n",
+ " '../data/anekdot_others.csv',\n",
+ " '../data/ru_lpsh_jokes.csv',\n",
+ "]\n",
+ "\n",
+ "qa_jokes = []\n",
+ "\n",
+ "for file in files:\n",
+ " jokes = pd.read_csv(file)\n",
+ " qa_anekdot = extract_qa_jokes(jokes['Text'].values, max_num_sents=3)\n",
+ " qa_jokes.extend(qa_anekdot)\n",
+ "len(qa_jokes)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# pd.DataFrame.from_dict(qa_jokes).to_csv('../data/rus_qa_jokes.csv')"
]
}
],
diff --git a/train/run_generation.py b/train/run_generation.py
index 3f90ee5..4fd429a 100644
--- a/train/run_generation.py
+++ b/train/run_generation.py
@@ -14,52 +14,47 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/CTRL/Transformer-XL/XLNet)
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
"""
-
+from __future__ import absolute_import, division, print_function, unicode_literals
import argparse
import logging
+from tqdm import trange
-import numpy as np
import torch
+import torch.nn.functional as F
+import numpy as np
+
+from transformers import (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig,
+ GPT2LMHeadModel, GPT2Tokenizer,
+ OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+ XLNetLMHeadModel, XLNetTokenizer,
+ TransfoXLLMHeadModel, TransfoXLTokenizer, )
+from yt_encoder import YTEncoder
+
-from transformers import (
- CTRLLMHeadModel,
- CTRLTokenizer,
- GPT2LMHeadModel,
- GPT2Tokenizer,
- OpenAIGPTLMHeadModel,
- OpenAIGPTTokenizer,
- TransfoXLLMHeadModel,
- TransfoXLTokenizer,
- XLMTokenizer,
- XLMWithLMHeadModel,
- XLNetLMHeadModel,
- XLNetTokenizer,
-)
-
-
-logging.basicConfig(
- format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO,
-)
+logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
+ datefmt = '%m/%d/%Y %H:%M:%S',
+ level = logging.INFO)
logger = logging.getLogger(__name__)
MAX_LENGTH = int(10000) # Hardcoded max length to avoid infinite loop
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+
MODEL_CLASSES = {
- "gpt2": (GPT2LMHeadModel, GPT2Tokenizer),
- "ctrl": (CTRLLMHeadModel, CTRLTokenizer),
- "openai-gpt": (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
- "xlnet": (XLNetLMHeadModel, XLNetTokenizer),
- "transfo-xl": (TransfoXLLMHeadModel, TransfoXLTokenizer),
- "xlm": (XLMWithLMHeadModel, XLMTokenizer),
+ 'gpt2': (GPT2LMHeadModel, GPT2Tokenizer),
+ 'gpt2-yttm': (GPT2LMHeadModel, YTEncoder),
+ 'openai-gpt': (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+ 'xlnet': (XLNetLMHeadModel, XLNetTokenizer),
+ 'transfo-xl': (TransfoXLLMHeadModel, TransfoXLTokenizer),
}
# Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
# in https://github.com/rusiaaman/XLNet-gen#methodology
# and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
-PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
@@ -70,6 +65,7 @@
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. """
+FILTER_VALUE=-float('Inf')
def set_seed(args):
np.random.seed(args.seed)
@@ -78,186 +74,159 @@ def set_seed(args):
torch.cuda.manual_seed_all(args.seed)
-#
-# Functions to prepare models' input
-#
-
-
-def prepare_ctrl_input(args, _, tokenizer, prompt_text):
- if args.temperature > 0.7:
- logger.info("CTRL typically works better with lower temperatures (and lower top_k).")
-
- encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False)
- if not any(encoded_prompt[0] == x for x in tokenizer.control_codes.values()):
- logger.info("WARNING! You are not starting your generation from a control code so you won't get good results")
- return prompt_text
-
-
-def prepare_xlm_input(args, model, tokenizer, prompt_text):
- # kwargs = {"language": None, "mask_token_id": None}
-
- # Set the language
- use_lang_emb = hasattr(model.config, "use_lang_emb") and model.config.use_lang_emb
- if hasattr(model.config, "lang2id") and use_lang_emb:
- available_languages = model.config.lang2id.keys()
- if args.xlm_language in available_languages:
- language = args.xlm_language
- else:
- language = None
- while language not in available_languages:
- language = input("Using XLM. Select language in " + str(list(available_languages)) + " >>> ")
-
- model.config.lang_id = model.config.lang2id[language]
- # kwargs["language"] = tokenizer.lang2id[language]
-
- # TODO fix mask_token_id setup when configurations will be synchronized between models and tokenizers
- # XLM masked-language modeling (MLM) models need masked token
- # is_xlm_mlm = "mlm" in args.model_name_or_path
- # if is_xlm_mlm:
- # kwargs["mask_token_id"] = tokenizer.mask_token_id
-
- return prompt_text
-
-
-def prepare_xlnet_input(args, _, tokenizer, prompt_text):
- prompt_text = (args.padding_text if args.padding_text else PADDING_TEXT) + prompt_text
- return prompt_text
-
-
-def prepare_transfoxl_input(args, _, tokenizer, prompt_text):
- prompt_text = (args.padding_text if args.padding_text else PADDING_TEXT) + prompt_text
- return prompt_text
-
-
-PREPROCESSING_FUNCTIONS = {
- "ctrl": prepare_ctrl_input,
- "xlm": prepare_xlm_input,
- "xlnet": prepare_xlnet_input,
- "transfo-xl": prepare_transfoxl_input,
-}
-
-
-def adjust_length_to_model(length, max_sequence_length):
- if length < 0 and max_sequence_length > 0:
- length = max_sequence_length
- elif 0 < max_sequence_length < length:
- length = max_sequence_length # No generation bigger than model size
- elif length < 0:
- length = MAX_LENGTH # avoid infinite loop
- return length
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+ """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+ Args:
+ logits: logits distribution shape (vocabulary size)
+ top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+ top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+ Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+ From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+ """
+ assert logits.dim() == 1 # batch size 1 for now - could be updated for more but the code would be less clear
+ top_k = min(top_k, logits.size(-1)) # Safety check
+ if top_k > 0:
+ # Remove all tokens with a probability less than the last token of the top-k
+ indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+ logits[indices_to_remove] = filter_value
+
+ if top_p > 0.0:
+ sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+ # Remove tokens with cumulative probability above the threshold
+ sorted_indices_to_remove = cumulative_probs > top_p
+ # Shift the indices to the right to keep also the first token above the threshold
+ sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+ sorted_indices_to_remove[..., 0] = 0
+
+ indices_to_remove = sorted_indices[sorted_indices_to_remove]
+ logits[indices_to_remove] = filter_value
+ return logits
+
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0,
+ is_xlnet=False, device='cpu', max_input=1023, filter_single=[], filter_double=[]):
+ context = torch.tensor(context, dtype=torch.long, device=device)
+ context = context.unsqueeze(0).repeat(num_samples, 1)
+ generated = context
+ with torch.no_grad():
+ for _ in trange(length):
+
+ inputs = {'input_ids': generated[:,-max_input:]}
+ if is_xlnet:
+ # XLNet is a direct (predict same token, not next token) and bi-directional model by default
+ # => need one additional dummy token in the input (will be masked), attention mask and target mapping (see model docstring)
+ input_ids = torch.cat((generated, torch.zeros((1, 1), dtype=torch.long, device=device)), dim=1)
+ perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float, device=device)
+ perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
+ target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float, device=device)
+ target_mapping[0, 0, -1] = 1.0 # predict last token
+ inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}
+
+ outputs = model(**inputs) # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
+ next_tokens = torch.zeros(num_samples, dtype=torch.long).to(device)
+ for isample in range(num_samples):
+ next_token_logits = outputs[0][isample, -1, :] / temperature
+
+ next_token_logits[filter_single] = FILTER_VALUE
+ # filter blank line = double \n
+ if generated[isample, -1] in filter_double:
+ next_token_logits[generated[isample, -1]] = FILTER_VALUE
+
+ filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
+ next_tokens[isample] = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+
+ generated = torch.cat((generated, next_tokens.unsqueeze(-1)), dim=1)
+ return generated
def main():
parser = argparse.ArgumentParser()
- parser.add_argument(
- "--model_type",
- default=None,
- type=str,
- required=True,
- help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
- )
- parser.add_argument(
- "--model_name_or_path",
- default=None,
- type=str,
- required=True,
- help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
- )
-
+ parser.add_argument("--model_type", default=None, type=str, required=True,
+ help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+ parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+ help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
parser.add_argument("--prompt", type=str, default="")
+ parser.add_argument("--padding_text", type=str, default="")
parser.add_argument("--length", type=int, default=20)
parser.add_argument("--stop_token", type=str, default=None, help="Token at which text generation is stopped")
-
- parser.add_argument(
- "--temperature",
- type=float,
- default=1.0,
- help="temperature of 1.0 has no effect, lower tend toward greedy sampling",
- )
- parser.add_argument(
- "--repetition_penalty", type=float, default=1.0, help="primarily useful for CTRL model; in that case, use 1.2"
- )
- parser.add_argument("--k", type=int, default=0)
- parser.add_argument("--p", type=float, default=0.9)
-
- parser.add_argument("--padding_text", type=str, default="", help="Padding text for Transfo-XL and XLNet.")
- parser.add_argument("--xlm_language", type=str, default="", help="Optional language when used with the XLM model.")
-
- parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
- parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
+ parser.add_argument("--temperature", type=float, default=1.0)
+ parser.add_argument("--top_k", type=int, default=0)
+ parser.add_argument("--top_p", type=float, default=0.9)
+ parser.add_argument("--no_cuda", action='store_true',
+ help="Avoid using CUDA when available")
+ parser.add_argument('--seed', type=int, default=42,
+ help="random seed for initialization")
parser.add_argument("--num_return_sequences", type=int, default=1, help="The number of samples to generate.")
args = parser.parse_args()
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
- args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
+ args.n_gpu = torch.cuda.device_count()
set_seed(args)
- # Initialize the model and tokenizer
- try:
- args.model_type = args.model_type.lower()
- model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
- except KeyError:
- raise KeyError("the model {} you specified is not supported. You are welcome to add it and open a PR :)")
-
+ args.model_type = args.model_type.lower()
+ model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
model = model_class.from_pretrained(args.model_name_or_path)
model.to(args.device)
-
- args.length = adjust_length_to_model(args.length, max_sequence_length=model.config.max_position_embeddings)
- logger.info(args)
-
- prompt_text = args.prompt if args.prompt else input("Model prompt >>> ")
-
- # Different models need different input formatting and/or extra arguments
- requires_preprocessing = args.model_type in PREPROCESSING_FUNCTIONS.keys()
- if requires_preprocessing:
- prepare_input = PREPROCESSING_FUNCTIONS.get(args.model_type)
- preprocessed_prompt_text = prepare_input(args, model, tokenizer, prompt_text)
- encoded_prompt = tokenizer.encode(
- preprocessed_prompt_text, add_special_tokens=False, return_tensors="pt", add_space_before_punct_symbol=True
- )
- else:
- encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
- encoded_prompt = encoded_prompt.to(args.device)
-
- output_sequences = model.generate(
- input_ids=encoded_prompt,
- max_length=args.length + len(encoded_prompt[0]),
- temperature=args.temperature,
- top_k=args.k,
- top_p=args.p,
- repetition_penalty=args.repetition_penalty,
- do_sample=True,
- num_return_sequences=args.num_return_sequences,
- )
-
- # Remove the batch dimension when returning multiple sequences
- if len(output_sequences.shape) > 2:
- output_sequences.squeeze_()
-
- generated_sequences = []
-
- for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
- print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
- generated_sequence = generated_sequence.tolist()
-
- # Decode text
- text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
-
- # Remove all text after the stop token
- text = text[: text.find(args.stop_token) if args.stop_token else None]
-
- # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
- total_sequence = (
- prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
+ model.eval()
+
+ if args.length < 0 and model.config.max_position_embeddings > 0:
+ args.length = model.config.max_position_embeddings
+ elif 0 < model.config.max_position_embeddings < args.length:
+ args.length = model.config.max_position_embeddings # No generation bigger than model size
+ elif args.length < 0:
+ args.length = MAX_LENGTH # avoid infinite loop
+
+ print(args)
+ while True:
+ raw_text = args.prompt if args.prompt else input("Model prompt >>> ")
+ encoded_prompt = tokenizer.encode(raw_text)
+ if args.model_type in ["transfo-xl", "xlnet"]:
+ # Models with memory likes to have a long prompt for short inputs.
+ raw_text = (args.padding_text if args.padding_text else PADDING_TEXT) + raw_text
+ context_tokens = tokenizer.encode(raw_text)
+ out = sample_sequence(
+ model=model,
+ context=context_tokens,
+ length=args.length,
+ temperature=args.temperature,
+ top_k=args.top_k,
+ top_p=args.top_p,
+ device=args.device,
+ is_xlnet=bool(args.model_type == "xlnet"),
+ num_samples=args.num_return_sequences,
)
-
- generated_sequences.append(total_sequence)
- print(total_sequence)
-
- return generated_sequences
-
-
-if __name__ == "__main__":
+ # Remove the batch dimension when returning multiple sequences
+ print(out.shape)
+ if len(out.shape) > 2:
+ out.squeeze_()
+
+ generated_sequences = []
+ for generated_sequence_idx, generated_sequence in enumerate(out):
+ print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
+ generated_sequence = generated_sequence.tolist()
+
+ # Decode text
+ text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
+
+ # Remove all text after the stop token
+ text = text[: text.find(args.stop_token) if args.stop_token else None]
+
+ # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
+ total_sequence = (
+ args.prompt + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
+ )
+
+ generated_sequences.append(total_sequence)
+ print(total_sequence)
+ # text = tokenizer.decode(out) # , clean_up_tokenization_spaces=True
+ # print(text)
+ if args.prompt:
+ break
+ return text
+
+
+if __name__ == '__main__':
main()
diff --git a/train/run_language_modeling.py b/train/run_language_modeling.py
deleted file mode 100644
index 5d451e7..0000000
--- a/train/run_language_modeling.py
+++ /dev/null
@@ -1,787 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
-GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
-using a masked language modeling (MLM) loss.
-"""
-
-
-import argparse
-import glob
-import logging
-import os
-import pickle
-import random
-import re
-import shutil
-from typing import Dict, List, Tuple
-
-import numpy as np
-import torch
-from torch.nn.utils.rnn import pad_sequence
-from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
-from torch.utils.data.distributed import DistributedSampler
-from tqdm import tqdm, trange
-
-from transformers import (
- MODEL_WITH_LM_HEAD_MAPPING,
- WEIGHTS_NAME,
- AdamW,
- AutoConfig,
- AutoModelWithLMHead,
- AutoTokenizer,
- PreTrainedModel,
- PreTrainedTokenizer,
- get_linear_schedule_with_warmup,
-)
-
-
-try:
- from torch.utils.tensorboard import SummaryWriter
-except ImportError:
- from tensorboardX import SummaryWriter
-
-
-logger = logging.getLogger(__name__)
-
-
-MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
-MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
-
-
-class TextDataset(Dataset):
- def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
- assert os.path.isfile(file_path)
-
- block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)
-
- directory, filename = os.path.split(file_path)
- cached_features_file = os.path.join(
- directory, args.model_type + "_cached_lm_" + str(block_size) + "_" + filename
- )
-
- if os.path.exists(cached_features_file) and not args.overwrite_cache:
- logger.info("Loading features from cached file %s", cached_features_file)
- with open(cached_features_file, "rb") as handle:
- self.examples = pickle.load(handle)
- else:
- logger.info("Creating features from dataset file at %s", directory)
-
- self.examples = []
- with open(file_path, encoding="utf-8") as f:
- text = f.read()
-
- tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
-
- for i in range(0, len(tokenized_text) - block_size + 1, block_size): # Truncate in block of block_size
- self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))
- # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
- # If your dataset is small, first you should loook for a bigger one :-) and second you
- # can change this behavior by adding (model specific) padding.
-
- logger.info("Saving features into cached file %s", cached_features_file)
- with open(cached_features_file, "wb") as handle:
- pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
-
- def __len__(self):
- return len(self.examples)
-
- def __getitem__(self, item):
- return torch.tensor(self.examples[item], dtype=torch.long)
-
-
-class LineByLineTextDataset(Dataset):
- def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
- assert os.path.isfile(file_path)
- # Here, we do not cache the features, operating under the assumption
- # that we will soon use fast multithreaded tokenizers from the
- # `tokenizers` repo everywhere =)
- logger.info("Creating features from dataset file at %s", file_path)
-
- with open(file_path, encoding="utf-8") as f:
- lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
-
- self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)["input_ids"]
-
- def __len__(self):
- return len(self.examples)
-
- def __getitem__(self, i):
- return torch.tensor(self.examples[i], dtype=torch.long)
-
-
-def load_and_cache_examples(args, tokenizer, evaluate=False):
- file_path = args.eval_data_file if evaluate else args.train_data_file
- if args.line_by_line:
- return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
- else:
- return TextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
-
-
-def set_seed(args):
- random.seed(args.seed)
- np.random.seed(args.seed)
- torch.manual_seed(args.seed)
- if args.n_gpu > 0:
- torch.cuda.manual_seed_all(args.seed)
-
-
-def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
- ordering_and_checkpoint_path = []
-
- glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))
-
- for path in glob_checkpoints:
- if use_mtime:
- ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
- else:
- regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
- if regex_match and regex_match.groups():
- ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
-
- checkpoints_sorted = sorted(ordering_and_checkpoint_path)
- checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
- return checkpoints_sorted
-
-
-def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
- if not args.save_total_limit:
- return
- if args.save_total_limit <= 0:
- return
-
- # Check if we should delete older checkpoint(s)
- checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
- if len(checkpoints_sorted) <= args.save_total_limit:
- return
-
- number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
- checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
- for checkpoint in checkpoints_to_be_deleted:
- logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
- shutil.rmtree(checkpoint)
-
-
-def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]:
- """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
-
- if tokenizer.mask_token is None:
- raise ValueError(
- "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
- )
-
- labels = inputs.clone()
- # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
- probability_matrix = torch.full(labels.shape, args.mlm_probability)
- special_tokens_mask = [
- tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
- ]
- probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
- if tokenizer._pad_token is not None:
- padding_mask = labels.eq(tokenizer.pad_token_id)
- probability_matrix.masked_fill_(padding_mask, value=0.0)
- masked_indices = torch.bernoulli(probability_matrix).bool()
- labels[~masked_indices] = -100 # We only compute loss on masked tokens
-
- # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
- indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
- inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
-
- # 10% of the time, we replace masked input tokens with random word
- indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
- random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
- inputs[indices_random] = random_words[indices_random]
-
- # The rest of the time (10% of the time) we keep the masked input tokens unchanged
- return inputs, labels
-
-
-def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
- """ Train the model """
- if args.local_rank in [-1, 0]:
- tb_writer = SummaryWriter()
-
- args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
-
- def collate(examples: List[torch.Tensor]):
- if tokenizer._pad_token is None:
- return pad_sequence(examples, batch_first=True)
- return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
-
- train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
- train_dataloader = DataLoader(
- train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate
- )
-
- if args.max_steps > 0:
- t_total = args.max_steps
- args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
- else:
- t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
-
- model = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
- model.resize_token_embeddings(len(tokenizer))
-
- # Prepare optimizer and schedule (linear warmup and decay)
- no_decay = ["bias", "LayerNorm.weight"]
- optimizer_grouped_parameters = [
- {
- "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
- "weight_decay": args.weight_decay,
- },
- {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
- ]
- optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
- scheduler = get_linear_schedule_with_warmup(
- optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
- )
-
- # Check if saved optimizer or scheduler states exist
- if (
- args.model_name_or_path
- and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
- and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
- ):
- # Load in optimizer and scheduler states
- optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
- scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
-
- if args.fp16:
- try:
- from apex import amp
- except ImportError:
- raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
- model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
-
- # multi-gpu training (should be after apex fp16 initialization)
- if args.n_gpu > 1:
- model = torch.nn.DataParallel(model)
-
- # Distributed training (should be after apex fp16 initialization)
- if args.local_rank != -1:
- model = torch.nn.parallel.DistributedDataParallel(
- model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
- )
-
- # Train!
- logger.info("***** Running training *****")
- logger.info(" Num examples = %d", len(train_dataset))
- logger.info(" Num Epochs = %d", args.num_train_epochs)
- logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
- logger.info(
- " Total train batch size (w. parallel, distributed & accumulation) = %d",
- args.train_batch_size
- * args.gradient_accumulation_steps
- * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
- )
- logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
- logger.info(" Total optimization steps = %d", t_total)
-
- global_step = 0
- epochs_trained = 0
- steps_trained_in_current_epoch = 0
- # Check if continuing training from a checkpoint
- if args.model_name_or_path and os.path.exists(args.model_name_or_path):
- try:
- # set global_step to gobal_step of last saved checkpoint from model path
- checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
- global_step = int(checkpoint_suffix)
- epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
- steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
-
- logger.info(" Continuing training from checkpoint, will skip to saved global_step")
- logger.info(" Continuing training from epoch %d", epochs_trained)
- logger.info(" Continuing training from global step %d", global_step)
- logger.info(" Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
- except ValueError:
- logger.info(" Starting fine-tuning.")
-
- tr_loss, logging_loss = 0.0, 0.0
-
- model.zero_grad()
- train_iterator = trange(
- epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
- )
- set_seed(args) # Added here for reproducibility
- for epoch in train_iterator:
- epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
-
- if args.local_rank != -1:
- train_sampler.set_epoch(epoch)
-
- for step, batch in enumerate(epoch_iterator):
-
- # Skip past any already trained steps if resuming training
- if steps_trained_in_current_epoch > 0:
- steps_trained_in_current_epoch -= 1
- continue
-
- inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
- inputs = inputs.to(args.device)
- labels = labels.to(args.device)
- model.train()
- outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
- loss = outputs[0] # model outputs are always tuple in transformers (see doc)
-
- if args.n_gpu > 1:
- loss = loss.mean() # mean() to average on multi-gpu parallel training
- if args.gradient_accumulation_steps > 1:
- loss = loss / args.gradient_accumulation_steps
-
- if args.fp16:
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- else:
- loss.backward()
-
- tr_loss += loss.item()
- if (step + 1) % args.gradient_accumulation_steps == 0:
- if args.fp16:
- torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
- else:
- torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
- optimizer.step()
- scheduler.step() # Update learning rate schedule
- model.zero_grad()
- global_step += 1
-
- if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
- # Log metrics
- if (
- args.local_rank == -1 and args.evaluate_during_training
- ): # Only evaluate when single GPU otherwise metrics may not average well
- results = evaluate(args, model, tokenizer)
- for key, value in results.items():
- tb_writer.add_scalar("eval_{}".format(key), value, global_step)
- tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
- tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
- logging_loss = tr_loss
-
- if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
- checkpoint_prefix = "checkpoint"
- # Save model checkpoint
- output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
- os.makedirs(output_dir, exist_ok=True)
- model_to_save = (
- model.module if hasattr(model, "module") else model
- ) # Take care of distributed/parallel training
- model_to_save.save_pretrained(output_dir)
- tokenizer.save_pretrained(output_dir)
-
- torch.save(args, os.path.join(output_dir, "training_args.bin"))
- logger.info("Saving model checkpoint to %s", output_dir)
-
- _rotate_checkpoints(args, checkpoint_prefix)
-
- torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
- torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
- logger.info("Saving optimizer and scheduler states to %s", output_dir)
-
- if args.max_steps > 0 and global_step > args.max_steps:
- epoch_iterator.close()
- break
- if args.max_steps > 0 and global_step > args.max_steps:
- train_iterator.close()
- break
-
- if args.local_rank in [-1, 0]:
- tb_writer.close()
-
- return global_step, tr_loss / global_step
-
-
-def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prefix="") -> Dict:
- # Loop to handle MNLI double evaluation (matched, mis-matched)
- eval_output_dir = args.output_dir
-
- eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
-
- if args.local_rank in [-1, 0]:
- os.makedirs(eval_output_dir, exist_ok=True)
-
- args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
- # Note that DistributedSampler samples randomly
-
- def collate(examples: List[torch.Tensor]):
- if tokenizer._pad_token is None:
- return pad_sequence(examples, batch_first=True)
- return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
-
- eval_sampler = SequentialSampler(eval_dataset)
- eval_dataloader = DataLoader(
- eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate
- )
-
- # multi-gpu evaluate
- if args.n_gpu > 1:
- model = torch.nn.DataParallel(model)
-
- # Eval!
- logger.info("***** Running evaluation {} *****".format(prefix))
- logger.info(" Num examples = %d", len(eval_dataset))
- logger.info(" Batch size = %d", args.eval_batch_size)
- eval_loss = 0.0
- nb_eval_steps = 0
- model.eval()
-
- for batch in tqdm(eval_dataloader, desc="Evaluating"):
- inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
- inputs = inputs.to(args.device)
- labels = labels.to(args.device)
-
- with torch.no_grad():
- outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
- lm_loss = outputs[0]
- eval_loss += lm_loss.mean().item()
- nb_eval_steps += 1
-
- eval_loss = eval_loss / nb_eval_steps
- perplexity = torch.exp(torch.tensor(eval_loss))
-
- result = {"perplexity": perplexity}
-
- output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
- with open(output_eval_file, "w") as writer:
- logger.info("***** Eval results {} *****".format(prefix))
- for key in sorted(result.keys()):
- logger.info(" %s = %s", key, str(result[key]))
- writer.write("%s = %s\n" % (key, str(result[key])))
-
- return result
-
-
-def main():
- parser = argparse.ArgumentParser()
-
- # Required parameters
- parser.add_argument(
- "--train_data_file", default=None, type=str, required=True, help="The input training data file (a text file)."
- )
- parser.add_argument(
- "--output_dir",
- type=str,
- required=True,
- help="The output directory where the model predictions and checkpoints will be written.",
- )
- parser.add_argument(
- "--model_type", type=str, required=True, help="The model architecture to be trained or fine-tuned.",
- )
-
- # Other parameters
- parser.add_argument(
- "--eval_data_file",
- default=None,
- type=str,
- help="An optional input evaluation data file to evaluate the perplexity on (a text file).",
- )
- parser.add_argument(
- "--line_by_line",
- action="store_true",
- help="Whether distinct lines of text in the dataset are to be handled as distinct sequences.",
- )
- parser.add_argument(
- "--should_continue", action="store_true", help="Whether to continue from latest checkpoint in output_dir"
- )
- parser.add_argument(
- "--model_name_or_path",
- default=None,
- type=str,
- help="The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.",
- )
-
- parser.add_argument(
- "--mlm", action="store_true", help="Train with masked-language modeling loss instead of language modeling."
- )
- parser.add_argument(
- "--mlm_probability", type=float, default=0.15, help="Ratio of tokens to mask for masked language modeling loss"
- )
-
- parser.add_argument(
- "--config_name",
- default=None,
- type=str,
- help="Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.",
- )
- parser.add_argument(
- "--tokenizer_name",
- default=None,
- type=str,
- help="Optional pretrained tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new tokenizer.",
- )
- parser.add_argument(
- "--cache_dir",
- default=None,
- type=str,
- help="Optional directory to store the pre-trained models downloaded from s3 (instead of the default one)",
- )
- parser.add_argument(
- "--block_size",
- default=-1,
- type=int,
- help="Optional input sequence length after tokenization."
- "The training dataset will be truncated in block of this size for training."
- "Default to the model max input length for single sentence inputs (take into account special tokens).",
- )
- parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
- parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
- parser.add_argument(
- "--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step."
- )
-
- parser.add_argument("--per_gpu_train_batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.")
- parser.add_argument(
- "--per_gpu_eval_batch_size", default=4, type=int, help="Batch size per GPU/CPU for evaluation."
- )
- parser.add_argument(
- "--gradient_accumulation_steps",
- type=int,
- default=1,
- help="Number of updates steps to accumulate before performing a backward/update pass.",
- )
- parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
- parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
- parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
- parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
- parser.add_argument(
- "--num_train_epochs", default=1.0, type=float, help="Total number of training epochs to perform."
- )
- parser.add_argument(
- "--max_steps",
- default=-1,
- type=int,
- help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
- )
- parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
-
- parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
- parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
- parser.add_argument(
- "--save_total_limit",
- type=int,
- default=None,
- help="Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default",
- )
- parser.add_argument(
- "--eval_all_checkpoints",
- action="store_true",
- help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number",
- )
- parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
- parser.add_argument(
- "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
- )
- parser.add_argument(
- "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
- )
- parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
-
- parser.add_argument(
- "--fp16",
- action="store_true",
- help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
- )
- parser.add_argument(
- "--fp16_opt_level",
- type=str,
- default="O1",
- help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
- "See details at https://nvidia.github.io/apex/amp.html",
- )
- parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
- parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
- parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
- args = parser.parse_args()
-
- if args.model_type in ["bert", "roberta", "distilbert", "camembert"] and not args.mlm:
- raise ValueError(
- "BERT and RoBERTa-like models do not have LM heads but masked LM heads. They must be run using the --mlm "
- "flag (masked language modeling)."
- )
- if args.eval_data_file is None and args.do_eval:
- raise ValueError(
- "Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
- "or remove the --do_eval argument."
- )
- if args.should_continue:
- sorted_checkpoints = _sorted_checkpoints(args)
- if len(sorted_checkpoints) == 0:
- raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
- else:
- args.model_name_or_path = sorted_checkpoints[-1]
-
- if (
- os.path.exists(args.output_dir)
- and os.listdir(args.output_dir)
- and args.do_train
- and not args.overwrite_output_dir
- and not args.should_continue
- ):
- raise ValueError(
- "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
- args.output_dir
- )
- )
-
- # Setup distant debugging if needed
- if args.server_ip and args.server_port:
- # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
- import ptvsd
-
- print("Waiting for debugger attach")
- ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
- ptvsd.wait_for_attach()
-
- # Setup CUDA, GPU & distributed training
- if args.local_rank == -1 or args.no_cuda:
- device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
- args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
- else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
- torch.cuda.set_device(args.local_rank)
- device = torch.device("cuda", args.local_rank)
- torch.distributed.init_process_group(backend="nccl")
- args.n_gpu = 1
- args.device = device
-
- # Setup logging
- logging.basicConfig(
- format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
- datefmt="%m/%d/%Y %H:%M:%S",
- level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
- )
- logger.warning(
- "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
- args.local_rank,
- device,
- args.n_gpu,
- bool(args.local_rank != -1),
- args.fp16,
- )
-
- # Set seed
- set_seed(args)
-
- # Load pretrained model and tokenizer
- if args.local_rank not in [-1, 0]:
- torch.distributed.barrier() # Barrier to make sure only the first process in distributed training download model & vocab
-
- if args.config_name:
- config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
- elif args.model_name_or_path:
- config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
- else:
- # When we release a pip version exposing CONFIG_MAPPING,
- # we can do `config = CONFIG_MAPPING[args.model_type]()`.
- raise ValueError(
- "You are instantiating a new config instance from scratch. This is not supported, but you can do it from another script, save it,"
- "and load it from here, using --config_name"
- )
-
- if args.tokenizer_name:
- tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
- elif args.model_name_or_path:
- tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
- else:
- raise ValueError(
- "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
- "and load it from here, using --tokenizer_name"
- )
-
- if args.block_size <= 0:
- args.block_size = tokenizer.max_len
- # Our input block size will be the max possible for the model
- else:
- args.block_size = min(args.block_size, tokenizer.max_len)
-
- if args.model_name_or_path:
- model = AutoModelWithLMHead.from_pretrained(
- args.model_name_or_path,
- from_tf=bool(".ckpt" in args.model_name_or_path),
- config=config,
- cache_dir=args.cache_dir,
- )
- else:
- logger.info("Training new model from scratch")
- model = AutoModelWithLMHead.from_config(config)
-
- model.to(args.device)
-
- if args.local_rank == 0:
- torch.distributed.barrier() # End of barrier to make sure only the first process in distributed training download model & vocab
-
- logger.info("Training/evaluation parameters %s", args)
-
- # Training
- if args.do_train:
- if args.local_rank not in [-1, 0]:
- torch.distributed.barrier() # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
-
- train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
-
- if args.local_rank == 0:
- torch.distributed.barrier()
-
- global_step, tr_loss = train(args, train_dataset, model, tokenizer)
- logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
-
- # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
- if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
- # Create output directory if needed
- if args.local_rank in [-1, 0]:
- os.makedirs(args.output_dir, exist_ok=True)
-
- logger.info("Saving model checkpoint to %s", args.output_dir)
- # Save a trained model, configuration and tokenizer using `save_pretrained()`.
- # They can then be reloaded using `from_pretrained()`
- model_to_save = (
- model.module if hasattr(model, "module") else model
- ) # Take care of distributed/parallel training
- model_to_save.save_pretrained(args.output_dir)
- tokenizer.save_pretrained(args.output_dir)
-
- # Good practice: save your training arguments together with the trained model
- torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
-
- # Load a trained model and vocabulary that you have fine-tuned
- model = AutoModelWithLMHead.from_pretrained(args.output_dir)
- tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
- model.to(args.device)
-
- # Evaluation
- results = {}
- if args.do_eval and args.local_rank in [-1, 0]:
- checkpoints = [args.output_dir]
- if args.eval_all_checkpoints:
- checkpoints = list(
- os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
- )
- logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
- logger.info("Evaluate the following checkpoints: %s", checkpoints)
- for checkpoint in checkpoints:
- global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
- prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
-
- model = AutoModelWithLMHead.from_pretrained(checkpoint)
- model.to(args.device)
- result = evaluate(args, model, tokenizer, prefix=prefix)
- result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
- results.update(result)
-
- return results
-
-
-if __name__ == "__main__":
- main()
diff --git a/train/run_lm_finetuning.py b/train/run_lm_finetuning.py
new file mode 100644
index 0000000..1f6ce7e
--- /dev/null
+++ b/train/run_lm_finetuning.py
@@ -0,0 +1,662 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
+GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
+using a masked language modeling (MLM) loss.
+"""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import glob
+import logging
+import os
+import pickle
+import random
+import regex as re
+import shutil
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
+from torch.utils.data.distributed import DistributedSampler
+
+try:
+ from torch.utils.tensorboard import SummaryWriter
+except:
+ from tensorboardX import SummaryWriter
+
+from tqdm import tqdm, trange
+from dataclasses import dataclass
+from fastprogress import progress_bar
+from fastai.basics import *
+
+from run_generation import sample_sequence
+
+from transformers import (WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup, get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup,
+ BertConfig, BertForMaskedLM, BertTokenizer,
+ GPT2Config, GPT2LMHeadModel, GPT2Tokenizer,
+ OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer,
+ RobertaConfig, RobertaForMaskedLM, RobertaTokenizer,
+ DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer)
+
+from yt_encoder import YTEncoder
+
+logger = logging.getLogger(__name__)
+
+
+MODEL_CLASSES = {
+ 'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
+ 'gpt2-yttm': (GPT2Config, GPT2LMHeadModel, YTEncoder),
+ 'openai-gpt': (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
+ 'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
+ 'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
+ 'distilbert': (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer)
+}
+
+@dataclass
+class MovingLoss():
+ steps:int = 1000
+ avg_loss = (0.0, 0.0)
+ def add(self, batch_loss:float):
+ k_s = 1 - 1 / self.steps
+ avg_loss = self.avg_loss
+ self.avg_loss = (self.avg_loss[0] * k_s + batch_loss * (1-k_s),
+ self.avg_loss[1] * k_s + 1.0 * (1-k_s))
+ @property
+ def loss(self):
+ if self.avg_loss[1]:
+ return self.avg_loss[0] / self.avg_loss[1]
+
+def print_sample(model, tokenizer, device, args):
+ model.eval()
+ raw_text = """ На словах ты Лев Толстой,\n А на деле -"""
+ context_tokens = tokenizer.encode(raw_text)
+ out = sample_sequence(
+ model=model,
+ context=context_tokens,
+ length=500,
+ temperature=1,
+ top_k=0,
+ top_p=0.9,
+ device=device,
+ #is_xlnet=bool(args.model_type == "xlnet"),
+ )
+ out = out[0, len(context_tokens):].tolist()
+ text = raw_text + tokenizer.decode(out)
+ print(text)
+
+ with open(os.path.join(args.output_dir, 'sample.txt'), 'w') as f:
+ f.write(text)
+
+ model.train()
+
+class TextDataset(Dataset):
+ @staticmethod
+ def process_file(file_path, tokenizer, block_size, shuffle):
+ directory, filename = os.path.split(file_path)
+ directory = os.path.join(directory, 'cached')
+ os.makedirs(directory, exist_ok=True)
+ cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{tokenizer.hash}_{filename}')
+
+ if os.path.exists(cached_features_file):
+ with open(cached_features_file, 'rb') as handle:
+ tokenized_text = pickle.load(handle)
+ else:
+ with open(file_path, encoding="utf-8") as f:
+ text = f.read()
+ if hasattr(tokenizer, 'encode'):
+ tokenized_text = tokenizer.encode(text)
+ else:
+ tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+ with open(cached_features_file, 'wb') as handle:
+ pickle.dump(tokenized_text, handle, protocol=pickle.HIGHEST_PROTOCOL)
+
+ examples = []
+ # add random shift
+ max_shift = max(min(block_size, len(tokenized_text) - block_size), 0)
+ rnd_shift = random.randrange(max_shift) if max_shift and shuffle else 0
+
+ for i in range(rnd_shift, len(tokenized_text)-block_size+1, block_size):
+ examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i:i+block_size]))
+ # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
+ # If your dataset is small, first you should loook for a bigger one :-) and second you
+ # can change this behavior by adding (model specific) padding.
+ return examples
+
+ def __init__(self, tokenizer, file_path='train', args=None, shuffle=True):
+ if not hasattr(tokenizer, 'hash'): tokenizer.hash = ''
+
+ logger.info(f"Loading features from {file_path}")
+ if os.path.isfile(file_path):
+ files = [file_path]
+ else:
+ assert os.path.isdir(file_path)
+ files = glob.glob(os.path.join(file_path, '*.txt'))
+
+ files = sorted(files)
+ if shuffle: random.shuffle(files)
+
+ files = files[:10000]
+
+ self.examples = []
+ for fn in progress_bar(files):
+ self.examples.extend(self.process_file(fn, tokenizer, args.block_size, shuffle))
+
+ def __len__(self):
+ return len(self.examples)
+
+ def __getitem__(self, item):
+ return torch.tensor(self.examples[item])
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False):
+ file_path = args.eval_data_file if evaluate else args.train_data_file
+ dataset = TextDataset(tokenizer, file_path=file_path, args=args, shuffle=not evaluate)
+ return dataset
+
+
+def set_seed(args):
+ random.seed(args.seed)
+ np.random.seed(args.seed)
+ torch.manual_seed(args.seed)
+ if args.n_gpu > 0:
+ torch.cuda.manual_seed_all(args.seed)
+
+def _rotate_checkpoints(args, checkpoint_prefix, use_mtime=False):
+ if not args.save_total_limit:
+ return
+ if args.save_total_limit <= 0:
+ return
+
+ # Check if we should delete older checkpoint(s)
+ glob_checkpoints = glob.glob(os.path.join(args.output_dir, '{}-*'.format(checkpoint_prefix)))
+ if len(glob_checkpoints) <= args.save_total_limit:
+ return
+
+ ordering_and_checkpoint_path = []
+ for path in glob_checkpoints:
+ if use_mtime:
+ ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
+ else:
+ regex_match = re.match('.*{}-([0-9]+)'.format(checkpoint_prefix), path)
+ if regex_match and regex_match.groups():
+ ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
+
+ checkpoints_sorted = sorted(ordering_and_checkpoint_path)
+ checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
+ number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
+ checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
+ for checkpoint in checkpoints_to_be_deleted:
+ logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
+ shutil.rmtree(checkpoint)
+
+
+def mask_tokens(inputs, tokenizer, args):
+ """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+ labels = inputs.clone()
+ # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+ probability_matrix = torch.full(labels.shape, args.mlm_probability)
+ special_tokens_mask = [tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()]
+ probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
+ masked_indices = torch.bernoulli(probability_matrix).bool()
+ labels[~masked_indices] = -1 # We only compute loss on masked tokens
+
+ # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+ indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
+ inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
+
+ # 10% of the time, we replace masked input tokens with random word
+ indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
+ random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
+ inputs[indices_random] = random_words[indices_random]
+
+ # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+ return inputs, labels
+
+def save_state(args, model, tokenizer, global_step):
+ def save_dir(output_dir):
+ # Create output directory if needed
+ if not os.path.exists(output_dir) and args.local_rank in [-1, 0]:
+ os.makedirs(output_dir)
+ logger.info(f"Saving model checkpoint to {output_dir}")
+ # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+ # They can then be reloaded using `from_pretrained()`
+ model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
+ model_to_save.save_pretrained(output_dir)
+ tokenizer.save_pretrained(output_dir)
+
+ # Good practice: save your training arguments together with the trained model
+ torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+ with open(os.path.join(output_dir, 'step.txt'), 'w') as c: c.write(str(global_step))
+
+ save_dir(args.output_dir)
+ checkpoint_prefix = 'checkpoint'
+ output_dir = os.path.join(args.output_dir, f'{checkpoint_prefix}-{global_step}')
+ save_dir(output_dir)
+ _rotate_checkpoints(args, checkpoint_prefix)
+
+class SummaryWriterP(SummaryWriter):
+ def __init__(self, prefix=None, logdir=None, comment='', *args, **kwargs):
+ if prefix:
+ import socket
+ from datetime import datetime
+ current_time = datetime.now().strftime('%b%d_%H-%M-%S')
+ logdir = os.path.join(prefix,
+ 'runs', current_time + '_' + socket.gethostname() + comment)
+ super().__init__(logdir, comment, *args, **kwargs)
+
+def train(args, train_dataset, model, tokenizer):
+ """ Train the model """
+ if args.local_rank in [-1, 0]:
+ tb_writer = SummaryWriterP(args.output_dir)
+
+ # XXX: Collate function
+ args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+ train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+ train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+ if args.max_steps > 0:
+ t_total = args.max_steps
+ args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+ else:
+ t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+ # Prepare optimizer and schedule (linear warmup and decay)
+ no_decay = ['bias', 'LayerNorm.weight']
+ optimizer_grouped_parameters = [
+ {'params': [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+ {'params': [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+ ]
+ optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+ warmup_steps = args.warmup_samples // args.train_batch_size
+ if args.lr_decay:
+ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)
+ else:
+ scheduler = get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)
+
+ # XXX: No check if saved optimizer or scheduler states exist
+
+ if args.fp16:
+ try:
+ from apex import amp
+ except ImportError:
+ raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+ model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+ # multi-gpu training (should be after apex fp16 initialization)
+ if args.n_gpu > 1:
+ model = torch.nn.DataParallel(model)
+
+ # Distributed training (should be after apex fp16 initialization)
+ if args.local_rank != -1:
+ model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+ output_device=args.local_rank,
+ find_unused_parameters=True)
+
+ # Train!
+ logger.info("***** Running training *****")
+ logger.info(" Num examples = %d", len(train_dataset))
+ logger.info(" Num Epochs = %d", args.num_train_epochs)
+ logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+ logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
+ args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+ logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+ logger.info(" Total optimization steps = %d", t_total)
+
+ try:
+ with open(os.path.join(args.model_name_or_path, 'step.txt'), 'r') as c:
+ global_step = int(c.readline())
+ except OSError as e:
+ global_step = 0
+
+ tr_loss, logging_loss = 0.0, 0.0
+ moving_loss = MovingLoss(10000 // args.logging_steps)
+ model.zero_grad()
+
+ train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+ set_seed(args) # Added here for reproducibility (even between python 2 and 3)
+ try:
+ for _ in train_iterator:
+ epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+ for step, batch in enumerate(epoch_iterator):
+ inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
+ inputs = inputs.to(args.device)
+ labels = labels.to(args.device)
+ model.train()
+ outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
+ loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
+
+ if args.n_gpu > 1:
+ loss = loss.mean() # mean() to average on multi-gpu parallel training
+ if args.gradient_accumulation_steps > 1:
+ loss = loss / args.gradient_accumulation_steps
+
+ if args.fp16:
+ with amp.scale_loss(loss, optimizer) as scaled_loss:
+ scaled_loss.backward()
+ else:
+ loss.backward()
+
+ tr_loss += loss.item()
+ moving_loss.add(loss.item())
+ if (step + 1) % args.gradient_accumulation_steps == 0:
+ if args.fp16:
+ torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+ else:
+ torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+ optimizer.step()
+ scheduler.step() # Update learning rate schedule
+ model.zero_grad()
+ global_step += 1
+
+ # Log metrics
+ if args.local_rank == -1 and args.evaluate_during_training and global_step % args.eval_steps == 0: # Only evaluate when single GPU otherwise metrics may not average well
+ results = evaluate(args, model, tokenizer, f"checkpoint-{global_step}")
+ for key, value in results.items():
+ tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+
+ if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+ tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+ tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+ logging_loss = tr_loss
+ epoch_iterator.set_postfix(MovingLoss=f'{moving_loss.loss:.2f}', Perplexity=f'{torch.exp(torch.tensor(moving_loss.loss)):.2f}')
+
+ if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+ # Save model checkpoint
+ save_state(args, model, tokenizer, global_step)
+
+ if args.max_steps > 0 and global_step > args.max_steps:
+ epoch_iterator.close()
+ break
+ print_sample(model, tokenizer, args.device, args)
+ if args.max_steps > 0 and global_step > args.max_steps:
+ train_iterator.close()
+ break
+ except (KeyboardInterrupt, SystemExit):
+ save_state(args, model, tokenizer, global_step)
+ raise
+
+ if args.local_rank in [-1, 0]:
+ tb_writer.close()
+
+ return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+ # Loop to handle MNLI double evaluation (matched, mis-matched)
+ eval_output_dir = args.output_dir
+
+ eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
+
+ if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+ os.makedirs(eval_output_dir)
+
+ args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+ # Note that DistributedSampler samples randomly
+ eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
+ eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+ # Eval!
+ logger.info("***** Running evaluation {} *****".format(prefix))
+ logger.info(" Num examples = %d", len(eval_dataset))
+ logger.info(" Batch size = %d", args.eval_batch_size)
+ eval_loss = 0.0
+ nb_eval_steps = 0
+ model.eval()
+
+ for batch in tqdm(eval_dataloader, desc="Evaluating"):
+ batch = batch.to(args.device)
+
+ with torch.no_grad():
+ outputs = model(batch, masked_lm_labels=batch) if args.mlm else model(batch, labels=batch)
+ lm_loss = outputs[0]
+ eval_loss += lm_loss.item() #lm_loss.mean().item()
+ nb_eval_steps += 1
+
+ eval_loss = eval_loss / nb_eval_steps
+ perplexity = torch.exp(torch.tensor(eval_loss))
+
+ result = {
+ "perplexity": perplexity
+ }
+
+ output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+ with open(output_eval_file, "w") as writer:
+ logger.info("***** Eval results {} *****".format(prefix))
+ for key in sorted(result.keys()):
+ logger.info(" %s = %s", key, str(result[key]))
+ writer.write("%s = %s\n" % (key, str(result[key])))
+
+ return result
+
+
+def main():
+ parser = argparse.ArgumentParser()
+
+ ## Required parameters
+ parser.add_argument("--train_data_file", default=None, type=str, required=True,
+ help="The input training data file (a text file).")
+ parser.add_argument("--output_dir", default=None, type=str, required=True,
+ help="The output directory where the model predictions and checkpoints will be written.")
+
+ ## Other parameters
+ parser.add_argument("--eval_data_file", default=None, type=str,
+ help="An optional input evaluation data file to evaluate the perplexity on (a text file).")
+
+ parser.add_argument("--model_type", default="bert", type=str,
+ help="The model architecture to be fine-tuned.")
+ parser.add_argument("--model_name_or_path", default="bert-base-cased", type=str,
+ help="The model checkpoint for weights initialization.")
+
+ parser.add_argument("--mlm", action='store_true',
+ help="Train with masked-language modeling loss instead of language modeling.")
+ parser.add_argument("--mlm_probability", type=float, default=0.15,
+ help="Ratio of tokens to mask for masked language modeling loss")
+
+ parser.add_argument("--config_name", default="", type=str,
+ help="Optional pretrained config name or path if not the same as model_name_or_path")
+ parser.add_argument("--tokenizer_name", default="", type=str,
+ help="Optional pretrained tokenizer name or path if not the same as model_name_or_path")
+ parser.add_argument("--cache_dir", default="", type=str,
+ help="Optional directory to store the pre-trained models downloaded from s3 (instread of the default one)")
+ parser.add_argument("--block_size", default=-1, type=int,
+ help="Optional input sequence length after tokenization."
+ "The training dataset will be truncated in block of this size for training."
+ "Default to the model max input length for single sentence inputs (take into account special tokens).")
+ parser.add_argument("--do_train", action='store_true',
+ help="Whether to run training.")
+ parser.add_argument("--do_eval", action='store_true',
+ help="Whether to run eval on the dev set.")
+ parser.add_argument("--evaluate_during_training", action='store_true',
+ help="Run evaluation during training at each logging step.")
+ parser.add_argument('--eval_steps', type=int, default=100,
+ help="Evaluate every X updates steps.")
+ parser.add_argument("--do_lower_case", action='store_true',
+ help="Set this flag if you are using an uncased model.")
+
+ parser.add_argument("--per_gpu_train_batch_size", default=4, type=int,
+ help="Batch size per GPU/CPU for training.")
+ parser.add_argument("--per_gpu_eval_batch_size", default=4, type=int,
+ help="Batch size per GPU/CPU for evaluation.")
+ parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+ help="Number of updates steps to accumulate before performing a backward/update pass.")
+ parser.add_argument("--learning_rate", default=5e-5, type=float,
+ help="The initial learning rate for Adam.")
+ parser.add_argument("--weight_decay", default=0.0, type=float,
+ help="Weight deay if we apply some.")
+ parser.add_argument("--adam_epsilon", default=1e-6, type=float,
+ help="Epsilon for Adam optimizer.")
+ parser.add_argument("--max_grad_norm", default=1.0, type=float,
+ help="Max gradient norm.")
+ parser.add_argument("--num_train_epochs", default=1.0, type=float,
+ help="Total number of training epochs to perform.")
+ parser.add_argument("--max_steps", default=-1, type=int,
+ help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+ parser.add_argument("--warmup_samples", default=0, type=int,
+ help="Linear warmup over warmup_samples.")
+ parser.add_argument("--lr_decay", action='store_true',
+ help="Decay LR using get_linear_schedule_with_warmup.")
+
+ parser.add_argument("--unfreeze_level", default=-1, type=int,
+ help="If > 0: freeze all layers except few first and last.")
+
+ parser.add_argument('--logging_steps', type=int, default=50,
+ help="Log every X updates steps.")
+ parser.add_argument('--save_steps', type=int, default=50,
+ help="Save checkpoint every X updates steps.")
+ parser.add_argument('--save_total_limit', type=int, default=None,
+ help='Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default')
+ parser.add_argument("--eval_all_checkpoints", action='store_true',
+ help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
+ parser.add_argument("--no_cuda", action='store_true',
+ help="Avoid using CUDA when available")
+ parser.add_argument('--overwrite_output_dir', action='store_true',
+ help="Overwrite the content of the output directory")
+ parser.add_argument('--overwrite_cache', action='store_true',
+ help="Overwrite the cached training and evaluation sets")
+ parser.add_argument('--seed', type=int, default=42,
+ help="random seed for initialization")
+
+ parser.add_argument('--fp16', action='store_true',
+ help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+ parser.add_argument('--fp16_opt_level', type=str, default='O1',
+ help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+ "See details at https://nvidia.github.io/apex/amp.html")
+ parser.add_argument("--local_rank", type=int, default=-1,
+ help="For distributed training: local_rank")
+ parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.")
+ parser.add_argument('--server_port', type=str, default='', help="For distant debugging.")
+ args = parser.parse_args()
+
+ if args.model_type in ["bert", "roberta", "distilbert"] and not args.mlm:
+ raise ValueError("BERT and RoBERTa do not have LM heads but masked LM heads. They must be run using the --mlm "
+ "flag (masked language modeling).")
+ if args.eval_data_file is None and args.do_eval:
+ raise ValueError("Cannot do evaluation without an evaluation data file. Either supply a file to --eval_data_file "
+ "or remove the --do_eval argument.")
+
+ if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+ raise ValueError(f"Output directory ({args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome.")
+
+ # Setup distant debugging if needed
+ if args.server_ip and args.server_port:
+ # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+ import ptvsd
+ print("Waiting for debugger attach")
+ ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+ ptvsd.wait_for_attach()
+
+ # Setup CUDA, GPU & distributed training
+ if args.local_rank == -1 or args.no_cuda:
+ device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+ args.n_gpu = torch.cuda.device_count()
+ else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+ torch.cuda.set_device(args.local_rank)
+ device = torch.device("cuda", args.local_rank)
+ torch.distributed.init_process_group(backend='nccl')
+ args.n_gpu = 1
+ args.device = device
+
+ # Setup logging
+ logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
+ datefmt = '%m/%d/%Y %H:%M:%S',
+ level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+ logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+ args.local_rank, args.device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+ # Set seed
+ set_seed(args)
+
+ # Load pretrained model and tokenizer
+ if args.local_rank not in [-1, 0]:
+ torch.distributed.barrier() # Barrier to make sure only the first process in distributed training download model & vocab
+
+ config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+ config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+ tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+ if args.block_size <= 0:
+ args.block_size = tokenizer.max_len_single_sentence # Our input block size will be the max possible for the model
+ args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)
+ model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+ model.to(args.device)
+
+ print(200*'/')
+ print(len([param for item in flatten_model(model)
+ for param in item.parameters()
+ if param.requires_grad])) # freeze all layers but few first and last
+ if args.unfreeze_level >= 0:
+ flat = flatten_model(model)
+ flat = [item for item in flat if list(item.parameters())]
+ i_start = 3
+ i_end = 1
+ need_grads = set(flat[:i_start+args.unfreeze_level*3]) | set(flat[-(i_end+args.unfreeze_level*3):])
+ for item in flat:
+ requires_grad(item, item in need_grads)
+ print(200*'/')
+ print(len([param for item in flatten_model(model)
+ for param in item.parameters()
+ if param.requires_grad]))
+
+ if args.local_rank == 0:
+ torch.distributed.barrier() # End of barrier to make sure only the first process in distributed training download model & vocab
+
+ logger.info("Training/evaluation parameters %s", args)
+
+ # Training
+ if args.do_train:
+ if args.local_rank not in [-1, 0]:
+ torch.distributed.barrier() # Barrier to make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+ train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
+
+ if args.local_rank == 0:
+ torch.distributed.barrier()
+
+ global_step, tr_loss = train(args, train_dataset, model, tokenizer)
+ logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+ # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
+ if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+ save_state(args, model, tokenizer, global_step)
+
+ # Load a trained model and vocabulary that you have fine-tuned
+ model = model_class.from_pretrained(args.output_dir)
+ tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+ model.to(args.device)
+
+ # Evaluation
+ results = {}
+ if args.do_eval and args.local_rank in [-1, 0]:
+ checkpoints = [args.output_dir]
+ if args.eval_all_checkpoints:
+ checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+ logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
+ logger.info("Evaluate the following checkpoints: %s", checkpoints)
+ for checkpoint in checkpoints:
+ global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+ model = model_class.from_pretrained(checkpoint)
+ model.to(args.device)
+ result = evaluate(args, model, tokenizer, prefix=global_step)
+ result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
+ results.update(result)
+
+ return results
+
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/train/yt_encoder.py b/train/yt_encoder.py
new file mode 100644
index 0000000..d9bd797
--- /dev/null
+++ b/train/yt_encoder.py
@@ -0,0 +1,58 @@
+"""Byte pair encoding utilities"""
+import os
+import youtokentome as yttm
+import hashlib
+from transformers.tokenization_utils import PreTrainedTokenizer
+import shutil
+import regex as re
+from os.path import samefile
+
+NEW_LINE = '<|n|>'
+
+class YTEncoder(PreTrainedTokenizer):
+ def_name = 'encoder.model'
+ def __init__(self, filename, *inputs, **kwargs):
+ super().__init__(*inputs, **kwargs)
+
+ if os.path.isdir(filename): filename = os.path.join(filename, self.def_name)
+
+ self.bpe = yttm.BPE(filename)
+ self.hash = hashlib.sha512(open(filename, 'rb').read()).hexdigest()[:10]
+ self.filename = filename
+
+ def encode(self, text):
+ if text and text[0] != ' ': text = ' ' + text
+ text = re.sub(r'(?=[^ ])([\W])([\w])',r'\g<1> \g<2>',text)
+ text = text.replace('\n', f' {NEW_LINE} ')
+
+ return self.bpe.encode([text], output_type=yttm.OutputType.ID)[0]
+
+
+ def decode(self, tokens): # I hate regexps
+ if not isinstance(tokens,list):
+ tokens = tokens.tolist()
+ result = self.bpe.decode(tokens)[0]
+ result = re.sub(r'( )?(<\|n\|>)( )?', r'\n', result)
+ result = re.sub(r'([\n(]) (\w)',r'\g<1>\g<2>', result)
+ result = re.sub(r'(\W)([«"''\n(]|^) (\w)',r'\g<1>\g<2>\g<3>', result)
+ result = re.sub(r'(\w)- (\w)',r'\g<1>-\g<2>', result)
+ return result
+
+ def tokenize(self, text, **kwargs):
+ return self.encode(text)
+
+ @classmethod
+ def from_pretrained(cls, *inputs, **kwargs):
+ return cls(*inputs, **kwargs)
+
+ def build_inputs_with_special_tokens(self, token_ids):
+ return token_ids
+
+ def add_special_tokens_single_sentence(self, token_ids):
+ return token_ids
+
+ def save_pretrained(self, save_directory):
+ src = self.filename
+ dst = os.path.join(save_directory, self.def_name)
+ if src != dst:
+ shutil.copyfile(src, dst)