!SLIDE
!SLIDE
- Search engineer at CV-Library (cv-library.co.uk)
- I am working with a lot of text data using Python to process it
Maciej Dziardziel (fiedzia@gmail.com)
!SLIDE
"A domain-specific language (DSL) is a computer language specialized to a particular application domain." (wikipedia)
!SLIDE left
- HTML
- CSS
- SQL
- Regular expressions
- Django ORM
- Programming languages
(programming languages are out of scope of this presentation, but you can use the tools we discuss today to design and process them)
!SLIDE left
- Provide easy to understand/write/process way of encoding domain-specific-knowledge
- You work with people not familiar with programming languages
- programming languge you are using is not suitable to work with given domain
- You want to process existing code for some DSL in a way no existing tool supports
!SLIDE left
- There are many job types, every type has its own terminology and concepts
- We needed a way to write down this knowledge to build set of job categories and assign them to jobs
- People who have this knowledge are not developers, so they needed simple tools they can work with easily
- We decided to create set of rules, where each rule may define conditions and actions
- rule may contain arbitrarily complex conditions
!SLIDE
Conditions:
title "python" and ("developer" or "programmer")
Actions:
add category "it/developers/python"
!SLIDE left
- parse this text into form usable in python
- ensure rules are valid according to definition of our DSL
- compile it to Python bytecode for efficient execution
!SLIDE
if "python" in job.title and ("developer" in job.title or "programmer" in job.title):
job.categories.add("it/developers/python")
!SLIDE left
- It just works
- Full power of Python
- Good error messages
- Full power of Python...
- Error messages understandable ... if you understan python
- Users (who are not developers) will need to understand enormous amount of Python concepts
- Code is tied to single implementation, its variables and datastructures
- Code is hard to analyze and modify
- Code is hard to optimize
- Its very easy to make all kinds of mistakes
- Running arbitrary code is dangerous
!SLIDE
#cat code1_python_function.py
def is_it_python_job(job):
if "python" in job.title and ("developer" in job.title or "programmer" in job.title):
job.labels.add("it/developers/python")
!SLIDE
import code1_python_function
code1_python_function.is_it_python_job
>>> <function code1_python_function.is_it_python_job>
type(code1_python_function.is_it_python_job.__code__)
>>> code
import dis
dis.dis(code1_python_function.is_it_python_job)
>>> 0 LOAD_CONST 1 ('python')
>>> 3 LOAD_FAST 0 (job)
>>> 6 LOAD_ATTR 0 (title)
>>> 9 COMPARE_OP 6 (in)
>>> ...
!SLIDE
code1_python_function.is_it_python_job.__code__.co_code
>> b'd\x01\x00|\x00\x00j\x00\x00k\x06\x00r@\x00d\x02\x00|\x00\x00j\x00\x00k\x06\x00s-\x00d\x03\x00|\x00\x00j\x00\x00k\x06\x00r@\x00|\x00\x00j\x01\x00j\x02\x00d\x04\x00\x83\x01\x00\x01n\x00\x00d\x00\x00S'
!SLIDE
Help on function parse in module ast:
parse(source, filename='<unknown>', mode='exec')
Parse the source into an AST node.
Equivalent to compile(source, filename, mode, PyCF_ONLY_AST).
!SLIDE
import ast
expr = ast.parse("2+2", "", "eval")
ast.dump(expr)
>>> 'Expression(body=BinOp(left=Num(n=2), op=Add(), right=Num(n=2)))'
list(ast.walk(expr))
>>> [<_ast.Expression at 0x7efef1c30630>,
>>> <_ast.BinOp at 0x7efef1c306a0>,
>>> <_ast.Num at 0x7efef1c30a90>,
>>> <_ast.Add at 0x7efef4f03da0>,
>>> <_ast.Num at 0x7efef1c306d8>]
expr = ast.parse('"python" and ("developer" or "programmer")', "", "eval")
ast.dump(expr)
>>> "Expression(body=BoolOp(op=And(), values=[Str(s='python'), BoolOp(op=Or(), values=[Str(s='developer'), Str(s='programmer')])]))"
!SLIDE
import ast
ALLOWED_AST_NODES = (ast.Module,ast.Expression, ast.And, ast.Or, ast.Expr, ast.Str, ast.BoolOp, ast.UnaryOp, ast.Not)
def check_expr(expr_str):
parsed = ast.parse(expr_str, '<rule>', 'eval')
for node in ast.walk(parsed):
if type(node) not in ALLOWED_AST_NODES:
raise Exception('node type not allowed: {node_type}'.format(node_type=type(node)))
check_expr('"python" and "developer"')
>>>
check_expr('print(3)')
>>> Exception: node type not allowed: <class '_ast.Call'>
!SLIDE
class ASTConvertStrToCall(ast.NodeTransformer):
"""Transform 's' into contains('s')"""
def visit_Str(self, node):
return ast.copy_location(
ast.Call(
func=ast.Name(id='contains', ctx=ast.Load()),
args=[ast.Str(node.s)],
keywords=[]
),
node
)
parsed = ast.parse(expr_str, '<rule>', 'eval')
tree = ASTConvertStrToCall().visit(parsed)
ast.fix_missing_locations(tree)
compiled = compile(tree, "<rule>", "eval")
namespace = {}
return eval(compiled, namespace)
!SLIDE
Generate python function that takes a string and return true if it matches our rules, false otherwise
code2_python_full_example.py
!SLIDE
- We can reduce risk of running arbitrary code (but getting it right may be hard)
- We can use simple syntax and transform AST nodes into desired form
- We are still dealing with python code
- Python AST is too complex or too low-level for many needs
- We are limited by python syntax
!SLIDE left
What we want:
- Arbitrary syntax
- Arbitrary AST Node types
!SLIDE left
- Language to define our DSL (usually its own DSL)
- Parser for our DSL
- Helper tools (debugging, visualisation)
!SLIDE
https://wiki.python.org/moin/LanguageParsing
!SLIDE left
http://igordejanovic.net/Arpeggio/
- Works with python3
- Well maintened
- Well documented
- Easy to use
- Good debugging tools
!SLIDE
number = r'\d*\.\d*|\d+'
factor = ("+" / "-")?
(number / "(" expression ")")
term = factor (( "*" / "/") factor)*
expression = term (("+" / "-") term)*
calc = expression+ EOF
!SLIDE
!SLIDE
code3_arpeggio_calc.py
!SLIDE
"python" and ("developer" or "programmer")
phrase = r'"[a-z]*"'
negation = r'not'
factor = (phrase / "(" expression ")")
neg_factor = negation factor
term = (factor/neg_factor) ( "and" (factor / neg_factor))*
neg_term = negation term
expression = (term / neg_term) ("or" (term / neg_term))*
neg_expression = negation expression
query = (expression/ neg_expression)+ EOF
!SLIDE
See code4_arpeggio_query.py
!SLIDE left
- we know how to define grammar of our DSL
- we can create a parser that will build AST for us (failing for invalid input)
- We can inspect and transform AST to build new one or to extract desired information
!SLIDE
!SLIDE
