
Making data analysis possible by building a Big Data platform
@ Orange
Key achievements
- Took part in every step of the building of a Hadoop Big Data industrial platform
- Used the newly build platform to implement two projects
- Took part in the constitution of a Data Engineering team from two to more than ten people
Client
Orange is the leading or second telecom company in most of Europeans countries as well as in Africa with a total of 266M clients.
I worked for Orange's French B2B data service for two years (2016-2018) :
- the first as a data engineer, building a Big Data platform.
- the second as a data scientist, building three Data Science projects based on the newly made available data.
Problem
In 2018, Orange's B2B data was distibuted among many applications. It made cross-referencing data long and complex and hard to standardize. The existing BI architecture and tools also didn't allow th storage and analysis of massive and less structured data.
At its creation, I joined a team in charge of building a Hadoop Big Data platform to allow access to all of the French B2B data in a single place, in a simple way, and with modern tools to analyze it.
Process
I arrived at the very beginning of the project and took part in all steps from the architecture definition of a Hadoop platform, to the building of a framework, to the realization of each needed module and the expansion of the team.
Here are the steps in which I took part :
1.
Defining an Architecture
Defining the development standards, the global processes, file organization, namings, ....
2.
Upskilling
Learning how to use the different Big Data tools needed (Spark, Hive, Nifi, Sqoop, Oozie, Hue, Ambari, ...), and sharing best practices.
3.
Building a Framework
Building a Bash Shell framework to allow an easy use of all the needed tools.
4.
Data collection
Automatically collecting data from multiple sources with Sqoop and Nifi.
5.
Data Ingestion
Ensuring the quality of the collected data and storing it in Operational Data Stores (ODS) with Hive (SQL-like).
6.
Data Transformation
Cross-referencing different sources and building the desired KPIs following the rules defined with business teams with Hive or Spark (Scala).
7.
Data Exposition
Allowing each application or user to access only the data it needs with Hive or Spark.
8.
Data Visualisaton
Visualizing data with tools like Qlik Sense or Plotly.
9.
Defining deployment steps
For each module, coding and documenting deployments steps for installation or update in production.
10.
Automating
For each task that becomes repetitive in previous steps, coding an automated way to perform it.
11.
Documenting and Transferring Knowledge
For each module, documenting enough so that a newcomer can understand how it works on his/her own. Ensuring that everyones knows how everything work so the team can keep on working even in case of absence or resignation.
Solution

Results
One year later, more than ten people had joined the team. We had gathered the data for many sources and were transforming and exposing data for two major projects in an automated and industrial way.
The first project was providing call centers managers a tool to better monitor their activity and improve client service.
The second consisted in building tools to better explore and understand the activity of clients on various digital suports, and improve customer experience.
Finally new large and less structured data was made available for future use, and the platform would allow future Machine Learning projects to be developed easily.
Tools & methods :
Hadoop, Spark, Hive, Nifi, Sqoop, Oozie, Hue, Ambari
SQL, Scala, Bash Shell, Linux (command line), Git
Agile methodology, Knowledge Management
Interested in cooperation or would like to discuss anything ?