Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 1 addition & 148 deletions lessons/lesson_07.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
" - gzip\n",
" - argparse\n",
" - math\n",
" - re\n",
" - numpy\n",
" - pandas\n",
"- tidy data"
Expand Down Expand Up @@ -252,134 +251,6 @@
"print(\"Gamma:\", math.gamma(3)) # Gamma function at x"
]
},
{
"cell_type": "markdown",
"id": "e44e0357",
"metadata": {},
"source": [
"# Library re\n",
"- `re` stands for `regular expression`, aka `regex`\n",
"- concept for text pattern matching\n",
"- Python converts the search pattern into a bytestring to search very efficiently in a memory object"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1616d23",
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"\n",
"# This text comes from the book \"20,000 Leagues Under the Sea\" by Jules Verne, published in 1870.\n",
"\n",
"story = \"\"\"On the 6th of November, 1867, the frigate Abraham Lincoln departed at 3:00 PM from Brooklyn pier.\n",
"The crew numbered 307 men and officers.\n",
"Captain Farragut had placed a reward of $2,000 for whoever first sighted the creature.\n",
"Professor Aronnax, a marine biologist from Paris, stood at the bow scanning the horizon.\n",
"The animal, if it exists, must be of considerable size — perhaps 200 feet in length.\n",
"The sea was calm; visibility extended roughly 15 nautical miles.\n",
"At latitude 31° 15' N, longitude 136° 42' E, they found nothing.\n",
"After 3 weeks with no sightings, the crew grew restless.\n",
"Then, on November 28th at 11:17 PM, the lookout cried: Object sighted — bearing 315 degrees!\n",
"The creature emitted a pale phosphorescent light and moved at approximately 40 knots.\n",
"Aronnax estimated its mass at no less than 1,500 tons.\n",
"Impossible, said Conseil quietly, and yet — there it is.\"\"\"\n",
"\n",
"# create pattern\n",
"pattern = \".*\\d{1,2}:\\d{2} PM.*\"\n",
"regex = re.compile(pattern)\n",
"\n",
"# search for pattern and print each line with the pattern in it\n",
"print(regex.findall(story))"
]
},
{
"cell_type": "markdown",
"id": "93a3e5a6",
"metadata": {},
"source": [
"- certain strings have specific meanings:\n",
" - `.*` = any number of any character before/after our pattern except `\\n`, including 0 observations\n",
" - `*` = any number of any character within our pattern, including 0 observations\n",
" - `\\d{1,2}` = one or two digits\n",
"- when compiling the search pattern, you can include certain flags\n",
" - `re.IGNORECASE` to have case insensitive matching\n",
" - `re.DOTALL` to have the `.` match all characters incl. the line end character `\\n`\n",
" - `re.MULTILINE` to handle multiple lines in a string separately, relevant for:\n",
" - `^` = beginning of a string / line\n",
" - `$` = end of a string/line\n",
"- to combine multiple flags, use the vertical line `|`, i.e. `re.DOTALL | re.MULTILINE`"
]
},
{
"cell_type": "markdown",
"id": "5e10a102",
"metadata": {},
"source": [
"- some special characters in search pattern:\n",
"\n",
"| Character | Meaning |\n",
"| :---: | :--- |\n",
"| . | any character except new line '\\n' |\n",
"| ^ | at the beginning of a string |\n",
"| $ | at the end of a string |\n",
"| * | multiplier >= 0 |\n",
"| + | multiplier >=1 |\n",
"| ? | multiplier 0-1 |\n",
"| {m} | specific multiplier, i.e. {3} |\n",
"| {m,n} | multiplier range, i.e. {2,4}, also {,4} or {4,} for half-open ranges |\n",
"| [ ] | character set to choose from, i.e. [ACGT], special characters become normal characters, i.e. [ab*] |\n",
"| [a-z] | a single lower case letter |\n",
"| [0-9] | a single digit |\n",
"| \\ | escape character, i.e. \\* is an asterisk and not a multiplier |\n",
"| \\| | logical or when combining |\n",
" "
]
},
{
"cell_type": "markdown",
"id": "8964ea64",
"metadata": {},
"source": [
"- several subfunctions are available for a pattern object\n",
"- below is an overview of the search functions and their result\n",
"- all expect a compiled pattern via `re.compile(<string>)` and the string to search in, flags can always be added after the string\n",
"\n",
"\n",
"| Subfunction | Description |\n",
"| :--- | :--- |\n",
"| `pattern.search(string)` | first match object |\n",
"| `pattern.match(string)` | matching object, but tests only the beginning of the string |\n",
"| `pattern.fullmatch(string)` | matching object only if whole string matches, otherwise returns RE |\n",
"| `pattern.findall(string)` | list of match |\n",
"| `pattern.finditer(string)` | iterator over match objects, similar to list of `.findall()` |\n",
"| `pattern.split(string,maxsplit=0)` | splits string based on occurance of the pattern, limited by maxsplit |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "423d4b35",
"metadata": {},
"outputs": [],
"source": [
"sequence = \"\"\"ATGGATAAGAAATACTCAATAGGCTTAGATATCGGCACAAATAGCGTCGGATGGGCGGTGATCACTGATG\n",
"AATATAAGGTTCCGTCTAAAAAGTTCAAGGTTCTGGGAAATACAGACCGCCACAGTATCAAAAAAAATCT\n",
"TATAGGGGCTCTTTTATTTGACAGTGGAGAGACAGCGGAAGCGACTCGTCTCAAACGGACAGCTCGTAGA\"\"\"\n",
"pattern = re.compile(\n",
" \"AT[ACT]GG[ACGT]\"\n",
") # represents AA sequence 'IG' = Isoleucine + Glycine\n",
"\n",
"print(\"first match\", pattern.search(sequence))\n",
"print(\"match beginning\", pattern.match(sequence))\n",
"print(\"whole string match\", pattern.fullmatch(sequence))\n",
"print(\"list of matches\", pattern.findall(sequence))\n",
"print(\"iterator for matches\", pattern.finditer(sequence))\n",
"print(\"split at matches\", pattern.split(sequence))"
]
},
{
"cell_type": "markdown",
"id": "fad61f15",
Expand Down Expand Up @@ -604,7 +475,7 @@
"metadata": {},
"source": [
"## Why to use numpy?\n",
"- numpy (and scipy) are fast, really fast\n",
"- numpy (also scipy) are fast, really fast\n",
"- for demonstration purposes, we will create 10K random numbers and add them together. We will repeat the step for the addition several times and test the performance with a (Jupyter) built-in function `%timeit`\n",
"- we will compare numpy with a for loop"
]
Expand Down Expand Up @@ -1127,24 +998,6 @@
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "1aa2b2d8",
"metadata": {},
"source": [
"- look for any stop codons in your sequence using regular expressions\n",
"- stop codons: UAA, UAG, UGA\n",
"- print to screen the positions for each stop codon"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c56a8cf0",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "8c660920",
Expand Down
1 change: 0 additions & 1 deletion lessons/lesson_08.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
" - gzip\n",
" - argparse\n",
" - math\n",
" - re\n",
" - numpy\n",
" - pandas\n",
"- tidy data"
Expand Down
29 changes: 0 additions & 29 deletions solutions/solutions_07.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -35,35 +35,6 @@
"random_sequence = \"\".join(random.choices(list(\"ACGU\"), k=length))"
]
},
{
"cell_type": "markdown",
"id": "1aa2b2d8",
"metadata": {},
"source": [
"- look for any stop codons in your sequence using regular expressions\n",
"- stop codons: UAA, UAG, UGA\n",
"- print to screen the positions for each stop codon"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c56a8cf0",
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"\n",
"pattern = re.compile(\"UAA|UAG|UGA\")\n",
"\n",
"positions = []\n",
"\n",
"for p in re.finditer(pattern, random_sequence):\n",
" positions.append(p.span())\n",
"\n",
"print(positions)"
]
},
{
"cell_type": "markdown",
"id": "8c660920",
Expand Down