Rediscover the known Universe with NASA datasets

A presentation at Codemotion Rome in March 2019 in Rome, Metropolitan City of Rome, Italy by Horacio Gonzalez

Slide 1

Slide 1

Rome | March 22 - 23, 2019 Rediscover the known Universe with NASA datasets Horacio Gonzalez @LostInBrittany @LostInBrittany

Slide 2

Slide 2

Horacio Gonzalez @LostInBrittany Spaniard lost in Brittany, developer, dreamer and all-around geek @LostInBrittany

Slide 3

Slide 3

HelloExoWorld Looking for exoplanets in NASA datasets @LostInBrittany

Slide 4

Slide 4

HelloExoWorld Once upon a time… @LostInBrittany

Slide 5

Slide 5

An amateur astronomer Pierre Zemb, DevOps OVH @LostInBrittany

Slide 6

Slide 6

What not to do if you love astronomy Live in Brest @LostInBrittany

Slide 7

Slide 7

Looking for solutions Computer stuff Astronomy Mixing passions @LostInBrittany

Slide 8

Slide 8

Google is your friend… Let’s find a project @LostInBrittany

Slide 9

Slide 9

Exoplanets? Planets orbiting stars far away @LostInBrittany

Slide 10

Slide 10

How do we find them? The transit method seems the best @LostInBrittany

Slide 11

Slide 11

The transit method Credits: NASA’s Goddard Space Flight Center @LostInBrittany

Slide 12

Slide 12

How do we look for transits? Image credits : NASA Kepler @LostInBrittany

Slide 13

Slide 13

Watching the sky By Carter Roberts [Public domain], via Wikimedia Commons @LostInBrittany

Slide 14

Slide 14

And what kind of data we get? Pleiades By NASA, ESA, AURA/Caltech, Palomar Observatory. Via Wikimedia Common @LostInBrittany

Slide 15

Slide 15

Well, that’s the problem Seven stars, seven different profiles @LostInBrittany

Slide 16

Slide 16

Kinda big data Over 40 million light curves @LostInBrittany

Slide 17

Slide 17

Big AND open data Lots of datasets in #opendata @LostInBrittany

Slide 18

Slide 18

And we can help with that! Let’s use our tools to analyse the data @LostInBrittany

Slide 19

Slide 19

Time Series To analyse Kepler datasets @LostInBrittany

Slide 20

Slide 20

Kepler: spatial Time Series Definition of Time Series: A series of data points indexed in time order @LostInBrittany

Slide 21

Slide 21

Time Series ● ● ● ● ● ● ● Stock Market Analysis Economic Forecasting Budgetary Analysis Process and Quality Control Workload Projections Census Analysis … @LostInBrittany

Slide 22

Slide 22

Time Series Applications: ● Understanding the data ● Fit a model ○ Monitoring ○ Forecasting @LostInBrittany

Slide 23

Slide 23

Time Series Stock market Analytics Economic Forecasting $$$ Study & Research @LostInBrittany

Slide 24

Slide 24

Time Series Many specific analytical tools: ● ● ● ● ● ● Moving average ARMA (AutoRegressive Moving Average) Multivariate ARMA models ARCH (AutoRegressive Conditional Heteroscedasticity) Dynamic time warping … @LostInBrittany

Slide 25

Slide 25

Time Series Specific application of general tools ● ● ● ● ● Artificial neural networks Hidden Markov model Fourier & Wavelets transforms Entropy encoding … @LostInBrittany

Slide 26

Slide 26

Dealing with Time Series The 3 ‘v’ @LostInBrittany

Slide 27

Slide 27

A match made in heaven Warp 10, OVH Observability and HelloExoWorld @LostInBrittany

Slide 28

Slide 28

Monitoring OVH with Time Series @LostInBrittany

Slide 29

Slide 29

OVH Observability Data Platform Some of OVH Observability metrics: ● 1.5M datapoints/s, 24/7 ● Peaks at ~10M datapoints/s ● 500M unique series @LostInBrittany

Slide 30

Slide 30

Tools to deal with Time Series Many options @LostInBrittany

Slide 31

Slide 31

Metrics Data Platform @LostInBrittany

Slide 32

Slide 32

Metrics Data Platform + + @LostInBrittany

Slide 33

Slide 33

Why Warp 10? Warp 10 is a software platform that ● Ingests and stores time series ● Manipulates and analyzes time series @LostInBrittany

Slide 34

Slide 34

Analytics is the key to success Fetching data is only the tip of the iceberg @LostInBrittany

Slide 35

Slide 35

Manipulating Time Series with Warp 10 A true Time Series analysis toolbox ○ Hundreds of functions ○ Manipulation frameworks ○ Analysis workflow @LostInBrittany

Slide 36

Slide 36

What we have done ● ● ● ● Downloaded and parsed 40 millions of FITS files Pushed it to OVH Metrics Select a cool subset as training set Verified we could find the same planets as NASA @LostInBrittany

Slide 37

Slide 37

Choosing a star: Kepler 11 Image credit: NASA/Tim Pyle @LostInBrittany

Slide 38

Slide 38

Looking at the raw signal… SAP_FLUX: The flux in units of electrons per second contained in the optimal aperture pixels collected by the spacecraft. @LostInBrittany

Slide 39

Slide 39

Looking at the raw signal… ? SAP_FLUX: The flux in units of electrons per second contained in the optimal aperture pixels collected by the spacecraft. @LostInBrittany

Slide 40

Slide 40

Looking at one record Perturbations in dirty signals @LostInBrittany

Slide 41

Slide 41

Transits are tiny ~40 electrons per second @LostInBrittany

Slide 42

Slide 42

First step: downsampling @LostInBrittany

Slide 43

Slide 43

First step: downsampling You can see the transit candidates… but how can we teach the computer to see them? @LostInBrittany

Slide 44

Slide 44

If you ♥ signal processing High pass filter @LostInBrittany

Slide 45

Slide 45

Poor person’s high pass filter Using the trend @LostInBrittany

Slide 46

Slide 46

Signal - Trend Now you can see them well @LostInBrittany

Slide 47

Slide 47

After some tuning We have our transit candidates @LostInBrittany

Slide 48

Slide 48

What’s next? Where do we go from here? @LostInBrittany

Slide 49

Slide 49

Only the beginning Better detection New import method Explorer Deep learning satellite/star location Yours? @LostInBrittany

Slide 50

Slide 50

A growing team @LostInBrittany

Slide 51

Slide 51

And you! Join us! https://helloexo.world https://xkcd.com/1371/ @LostInBrittany

Slide 52

Slide 52

Thank you! @LostInBrittany

Slide 53

Slide 53

Want to know more? Analysing with WarpScript @LostInBrittany

Slide 54

Slide 54

WarpScript Reverse Polish Notation @LostInBrittany

Slide 55

Slide 55

Variables ‘hello, world!’ // Push Hello World String on the Stack ‘exo’ STORE // Store it in a variable called exo $exo // Then push back exo variable on the stack @LostInBrittany

Slide 56

Slide 56

What are the available series? [ $readToken // Application authentication ‘~.*’ // selector for classname {} // Selector for labels ] FIND @LostInBrittany

Slide 57

Slide 57

Get raw data [ $readToken // Application authentication ‘sap.flux’ // selector for classname { ‘KEPLERID’ ‘6541920’ } // Selector for labels ‘2009-05-02T00:56:10.000000Z’ // Start date ‘2013-05-11T12:02:06.000000Z’ // End date ] FETCH @LostInBrittany

Slide 58

Slide 58

Kepler-11: Raw data @LostInBrittany

Slide 59

Slide 59

Time manipulation @LostInBrittany

Slide 60

Slide 60

Time related functions @LostInBrittany

Slide 61

Slide 61

How to split a Time series $gts // Singleton (or list of) GTS 6h // Minimum of time without data-points 100 // Minimum of data-points required ‘record’ // New labels to subdivide the result TIMESPLIT @LostInBrittany

Slide 62

Slide 62

Filtering [ $gts // Singleton (or list of) GTS [] // Equivalence classes { ‘record’ ‘5’ } // Labels to select filter.bylabels // Type of filter ] FILTER @LostInBrittany

Slide 63

Slide 63

Reference record: 5 @LostInBrittany

Slide 64

Slide 64

Downsampling @LostInBrittany

Slide 65

Slide 65

Bucketize @LostInBrittany

Slide 66

Slide 66

Syntax Time series parameter [ $gts bucketizer.min 0 2h Singleton 0 ] BUCKETIZE Time-series set @LostInBrittany

Slide 67

Slide 67

Syntax Bucketizer [ $gts bucketizer.min 0 2h 0 ] BUCKETIZE Type of operator to apply on each bucket last, max, mean, and, count … @LostInBrittany

Slide 68

Slide 68

Syntax Lastbucket [ $gts bucketizer.min 0 2h 0 ] BUCKETIZE End timestamp of the more recent bucket @LostInBrittany

Slide 69

Slide 69

Syntax Bucketspan [ $gts bucketizer.min 0 2h 0 ] BUCKETIZE Width of a bucket @LostInBrittany

Slide 70

Slide 70

Syntax Bucketcount [ $gts bucketizer.min 0 2h 0 ] BUCKETIZE Number of buckets to keep @LostInBrittany

Slide 71

Slide 71

Actual @LostInBrittany

Slide 72

Slide 72

Trend @LostInBrittany

Slide 73

Slide 73

Mapper @LostInBrittany

Slide 74

Slide 74

Syntax Time series parameter [ $gts mapper.mean 2 2 Singleton 0 ] MAP Time-series set @LostInBrittany

Slide 75

Slide 75

Syntax Mapper [ $gts mapper.mean 2 2 0 ] MAP Type of operator to apply on each window add, gt, rate, and, count… @LostInBrittany

Slide 76

Slide 76

Syntax Pre [ $gts mapper.mean 2 2 0 ] MAP Number of data-points before @LostInBrittany

Slide 77

Slide 77

Syntax Post [ $gts mapper.mean 2 2 0 ] MAP Number of data-points after @LostInBrittany

Slide 78

Slide 78

Syntax Occurrence [ $gts mapper.mean 2 2 0 ] MAP Maximal number of calculation for a data-point @LostInBrittany

Slide 79

Slide 79

Actual @LostInBrittany

Slide 80

Slide 80

Trend @LostInBrittany

Slide 81

Slide 81

Actual - trend @LostInBrittany

Slide 82

Slide 82

Actual - trend @LostInBrittany

Slide 83

Slide 83

Time to level-up! @LostInBrittany

Slide 84

Slide 84

Time series operation [ $gts0 // First series pull … // … $gtsN // N series pull [ ‘record’ ] // Key labels list op.add // Type of operator ] APPLY @LostInBrittany

Slide 85

Slide 85

Syntax Time series parameter [ $gts0 … $gtsN [ ‘record’ ] Singleton op.add ] APPLY Time-series set @LostInBrittany

Slide 86

Slide 86

Syntax Equivalence class [ Records data $gts0 … $gtsN [ ‘record’ ] op.add ] Record 1 Record 3 APPLY Record 2 @LostInBrittany

Slide 87

Slide 87

Syntax Operator Record 1 [ Record 3 $gts0 … $gtsN Record 2 [ ‘record’ ] op.add ] APPLY Type of operator to apply on each class sub, gt, mask, and, mul … @LostInBrittany

Slide 88

Slide 88

Final result @LostInBrittany