M42PL is a Data Manipulation Language, inspired by Unix shells and Splunk.
The language is extremely simple to learn and to use. It is designed to make data manipulation trivial, even for non-technical users.
- M42PL syntax is trivial:
- A program / script is a list of commands
- A command starts with the pipe
|
character - A command may takes argument(s) (also known as fields)
- Commands may implements their own custom grammar (like the
stats
command), which means that only the|
(pipe) and[]
(sub-pipeline) structures are part of the core language - M42PL runs on a Python interpreter, but its execution is managed by dispatchers which controls how and where the code is ran: on the local interpreter, on multiple processes, on Celery nodes, etc.
- Most fields (commands arguments) in M42PL behaves like lambdas; To refers to
the field
user.name
, you can write:data=user.name
(direct access)data={user.name}
(in-line JsonPath expression)data=`field(user.name)`
(in-line eval statement)data=[ | curl '...' | fields response.json.name ]
(in-line sub-pipeline)
M42PL can run scripts or can be run in REPL mode. M42PL scripts are
standard text files, which end with .mpl
or .m42pl
by convention.
To start an interpreter (REPL), run the command m42pl repl
(type exit
to leave):
$ m42pl repl
m42pl |
Mandatory hello world script:
| make | eval hello = 'world !'
You may run a M42PL script using the m42pl run
command:
$ m42pl run <filename.mpl>
A M42PL script is a pipeline (a list of commands starting with
pipes |
):
| make | eval foo = 'bar' | output
You can separate commands with new lines too (new lines are ignored):
| make
| eval foo = 'bar'
| output
Most commands takes parameters (aka. fields):
| make 2 showinfo=yes
| output
- positional parameters have no name (ex:
2
) - named parameters are prefixed with their name (ex:
showinfo=yes
)
Commands parameters (aka. fields) support various syntax:
Example | Field | Description |
---|---|---|
| make count=2 |
2 |
Nunber |
| output format='json' |
'json' |
String |
| make showinfo=`True` |
`True` |
Eval expression |
| fields response.items |
response.items |
Field path variable |
| fields {response.items[0]} |
{response.items[0]} |
JSON path variable |
| wget url=[| read url.txt] |
[| read url.txt] |
Sub-pipeline |
A comment is a call to the | ignore
or | comment
command:
| make
| ignore eval foo='bar'
| output format=json
Commands have multiple names (aka. aliases); The following snippets are identical:
| make | eval foo='bar' | output
| makeevent | evaluate foo='bar' | print
Five types of commands exists:
- Generating: One per pipeline; Generate events (e.g. performs an HTTP request, read a file, consume a queue, etc.)
- Streaming: Process events as they arrive. As many as needed per pipeline.
- Buffering: Process events batches. As many as needed per pipeline.
- Meta: Control the pipeline behaviour and parameters. As many as needed per pipeline.
- Merging: Indicates that a split pipeline must be merged.
Having a single generating command per pipeline may looks limitating, but M42PL supports sub-pipelines:
| readfile 'list_of_urls.txt'
| foreach [
| wget url
| fields response.content
]
Commands may also implements their own, custom grammar:
| make count=10 showinfo=yes
| rename id as event.id
| eval is_even = even(event.id)
| stats count by is_even
M42PL is build on four components:
Component | Description | Link |
---|---|---|
m42pl_core |
Languages core (base classes, utils, etc.) | GitHub |
m42pl_commands |
Core language commands | GitHub |
m42pl_dispatchers |
Executes M42PL scripts and REPL | GitHub |
m42pl_kvstores |
Key/values stores support | GitHub |
m42pl_encoders |
Encode and decode data formats | GitHub |
You may find extra components packages such as the lab commands.
Implements the base classes and the language utilities.
Implements most of M42PL functionnalities.
- Repository link
- [Documentation][m42pl-docs-commands]
Implements M42PL execution method (local, multi-processing, Celery, etc.).
Implements M42PL key/value stores.
Implements data format casting encoding and decoding (e.g. cast to
msgpack
, JSON
, bson
, etc.).
Create and activate a virtual environement:
python3 -m virtualenv m42pl
source m42pl/bin/activate
Install the core language m42pl_core
:
git clone https://github.com/jpclipffel/m42pl-core
pip install m42pl-core
Install the core commands m42pl_commands
:
git clone https://github.com/jpclipffel/m42pl-commands
pip install m42pl-commands
Install the core dispatchers m42pl_dispatchers
:
git clone https://github.com/jpclipffel/m42pl-dispatchers
pip install m42pl-dispatchers
Install the core kvstores m42pl_kvsotres
:
git clone https://github.com/jpclipffel/m42pl-kvstores
pip install m42pl-kvstores
Install the core encoders m42pl_encoders
:
git clone https://github.com/jpclipffel/m42pl-encoders
pip install m42pl-encoders
No; I like to call it a data processing language because it is not designed to be a competitor to programming languages such as Python, Haskell, C++, etc.
Maybe it can be useful for people who program data flows (e.g. for prototyping), or for people who do not program and just want a quick and easy way to collect and process data.
I used to work with Splunk and Elastic Search a lot in a previous job. I loved both products, but I always had the feeling that Elastic Search's DSL syntax was too tedious and Splunk's SPL was too limitating.
I initially wanted to write a tool to query Elastic Search with the same language as Splunk, but I quickly drifted from my original goal.