Women in Big Data – East
Quarter 1 Newsletter
February 15th, 2018
Hello WiBD community! Hope your New Year is off to a great start. WiBD East wrapped up 2017 with a 1200-member strong community across Washington DC, New York City, Boston and Pittsburgh. We are looking forward to an exciting 2018 with the launch of a chapter in Atlanta and many opportunities to learn and network.
Application of Analytics in the field of Natural and Technological Disasters
Date: March 7th, 2018
Time: 6:00 PM
Where: McLean, VA
Maybe you’ve seen the 23andMe commercial featuring the young woman exploring her heritage around the world, or maybe you know someone who learned about a genetic predisposition or found a family relation from the service. Maybe you’ve even used one of the kits yourself, swabbing your cheek and wondering what answers your saliva holds.
But what kind of implications does direct-to-consumer genetic testing have in the realm of big data? Each swab generates data on an entire genome, and when these companies aggregate across millions of customers, they create a gold mine of genetic data. As someone who has always been interested in the intersection of bioethics and technology, I often find myself pondering both the vast research potential and the serious ethical concerns associated with these products.
Certainly, direct-to-consumer genetic testing companies can have a substantial impact in the field of medicine. According to 23andMe’s head of privacy, most of the company’s consumers consent to make their data (anonymized and in aggregate) available for research. Consequently, researchers who partner with the company can investigate the genetic basis of various conditions and possibly even develop diagnostic tools or gene therapies. For such endeavors as well as genealogical ones, these products provide a diverse and valuable source of data.
Yet there is a flip side as well, related to the privacy concerns that we face more and more when discussing big data. For example, what if one of these companies got hacked and exposed sensitive medical information about customers or their families? If several of your relatives get tested, how much genetic information can be gleaned about you? Several researchers have also questioned how much anonymity these companies can ensure and have demonstrated the ability to identify the origin of “anonymous” genetic data.
Several approaches can work in tandem to mitigate these concerns and promote medical and genealogical discoveries while maintaining proper consumer protections. From a technology standpoint, direct-to-consumer genetic testing companies need to commit to continued bolstering of their cyber and data security practices to prevent hacks. From a policy standpoint, it is important to support and strengthen legislation like the Genetic Information Nondiscrimination Act to ensure that employers and insurance providers do not use people’s genomes against them. Additionally, companies like 23andMe ought to expand their informed consent practices to ensure that consumers understand that their genomes could possibly be traced back to them, and the potential negative consequences should their data fall into the wrong hands. Such measures would allow researchers to make use of the big data these companies generate while limiting harmful outcomes.
The Women In Big Data West Coast/ Bay Area chapter kicked off 2018 on February 8 with our very first workshop: “Speaking Confidently: Ways to Influence Relationships and Career Growth.”
The event was sponsored by Intel and organized by Women in Big Data Forum leadership team members Yulia Tell and Stella Mashkevitch.
Our ‘WiBD instructor of the year” in 2017 – Bob Loftis, NowForward Coaching – taught this very useful soft skills workshop, which was attended by about 50 enthusiastic people who were eager to learn and participate in exercises.
Bob explained how a comfortable physical stance can prepare for alert and assertive communication and put us more at ease. Often we are lost in thought and subject to distractions that undermine our persuasiveness. “Now” is the only time we have. When fully present in communication, we “show up” authentically. Here are some key aspects of an agile approach to job search challenges:
There is no failure, only feedback.
Each experience brings me closer to what I want.
What must I do to show my brand, values, and successes?
How can I improve their (our) business impact?
Bob also explained how to make sure that you have a good dialogue. In good dialogue, we see interested expressions, focused attention, no over-talking, moderate and energized tones of voice, many people contributing, good, powerful, questions, direct responses, no blaming, lots of facts, a desire to reach agreement, clarifications, and summarizing. To practice having a good dialogue, we participated in an exercise of a role play of a new employee’s one-to-one discussion with a manager. We formed pairs and practiced the discussion about performance progress and steps to take for a promotion in a six month period.
Another important topic of discussion was stress. We shared how each of us experiences stress, how to recognize the impact of stress, and how to cope with it.
Then we moved on to the importance of being mindful in all our communications. I want to finish with Mindful Communication Guidelines from Bob:
Our truth (reality) is our perception.
We can find a common purpose.
We can learn from any situation.
It was a fantastic workshop. A huge thank you to Bob Loftis and all the participants who made it possible.
Big data has made a name for itself in the media recently as an exciting new tool that companies can use to boost their profits and build their brand name. The vast amounts of data generated through mobile devices, social media pages, geolocators and more all give companies invaluable information that can be used to boost their bottom line, but there are also less obvious benefits to collecting and analyzing big data.
Armed with the right information, companies are helped to create successful customer loyalty programs. As most business owners know, it’s much cheaper to retain an existing customer than to attract a new one. Big data gives companies the information that they need to track a customer’s moves, predict their future behavior, and act accordingly to keep the customer satisfied. Here are some tips on how your company can use big data to improve customer loyalty.
Get Instant Feedback
When using data from online transactions, social media accounts, and more, companies are able to get feedback from their customers in real time. They can track where a customer goes, what they buy, and what they thought about their experience to get ideas about how to improve a product or service.
Big data makes it easier to predict if a customer is experiencing any difficulties with their purchase. If a customer is dissatisfied, it can help if the company’s customer service department to reach out and address the issue. Not only does this make it faster to reach an acceptable solution, but it also prevents the customer from getting upset or irate. They’ll be less inclined to leave negative online reviews and more likely to return as a client.
Take a Personalized Approach
Big data allows companies to get to know important details about a person, such as their age, location, gender, and personal preferences. Communications and special offers can be targeted towards the customers who are most likely to accept. This saves companies both time and money, and customers are less likely to be annoyed by promotional emails and letters. Businesses can send out coupons based on a customer’s proximity to a store, for example, or they can create offers on products a client has mentioned on social media. This is a powerful tool that can be used to increase sales both in-store and online.
Improve the Customer Experience
A satisfied customer is a loyal customer, and big data is helping companies to ensure that people leave their establishment satisfied. Businesses can track mentions of their goods and services through social media channels, filtering words to see if there is an overall positive or negative reaction. Personalized services also help the customer to feel valued, and their needs acknowledged.
Companies that collect big data can incorporate what they learn into customer service training programs to help representatives better assist customers. It’s important that they remain aware of both overall trends and the individual needs of customers who deal with the company.
In order to improve customer retention, companies can collect and analyze big data. Big data not only helps companies boost their profits, but also to retain loyal customers. By carefully analyzing information that’s collected, businesses can work to improve their goods and services and continue impressing their customers year after year.
On July 11, 2017, the second Women in Big Data Meetup in Paris, France, took place at the SAP Leonardo Center. The topics of the day were Machine Learning and Predictive Analysis, and we were also treated to an overview of the newly opened Leonardo Center facility
Kicking off the event, Tina Rosario, Director of the Women in Big Data EMEA region, gave a few welcoming words and invited everyone to share in aperitifs and conversation. Shortly thereafter, we adjourned to the 20th floor conference room, where Nathalie Dietz, Co-Innovation LAB Consultant, delivered a presentation about the about design thinking, and about the solutions the Leonardo product delivers for problems in data management and analytics. Click here and here to learn more about Leonardo, and click here to take a look at a free online SAP Build tool.
Following the presentation was a Q&A session that organically evolved into a discussion on diversity and the challenges women in technology face. Specifics included:
Diversity hiring practices. Remy Chambard-Williams, Consultant SAP PPM at SAP France, elaborated on the tendencies to hire people who are either similar to you or to the team member who is leaving. Both practices are common and obstruct diversity hiring.
Low percentage of women on boards of tech companies.
The upside of the industry: passionate people work long hours–including parents of both genders; many companies have programs in place to address such challenges.
Tina Rosario wrapped up with an overview presentation on WiBD, then took questions:
Question: Why women? Why Big Data? How do you work with similar groups (women in tech, women in AI)? Answer: First, Big Data and analytics is a growing sector, and there is an increasing shortage of talent (ranging from very technical to managerial roles). Women constitute 50% of the population and should not be excluded from these career opportunities. Secondly, women tend to do really well in career tracks that have positive social impact, and Big Data field provides multiples opportunities of this kind, and many roles where human judgement and not losing the big picture are crucial skills.
Question: Do you collaborate with other meetups in Paris dedicated to Big Data and analytics fields? Answer: We are all working towards the same good cause of retaining and helping women grow in the tech sector. We already are in sync with other groups in Europe, and we are open to establishing new connections.
Question: What are WIBD EMEA objectives for 2018? Answer: There are several: We want to create a solid webinar/technical training roadmap, continue expanding our presence across Europe, and build a strong community. We’re also committed to understanding how we can best meet the specific needs of women in Europe, and adjusting our activities accordingly.
Question: What types of events you are organizing? What’s the ratio between face to face (meetup) and virtual (webinar) events? Answer: Local events are driven by the chapters and the online webinar roadmap covering EMEA. The ratio varies depending on the specific location and how active the local community is.
Question: What are the elements of corporate culture that make a company a great place to work for women? Answer: An Open dialogue about diversity, as well as a formal program aimed at addressing the unique challenges women face.
Question: How can we participate? Answer: Follow us on MeetUp, register as a member on our site, join our LinkedIn group, and check our global Twitter updates. Questions? Send an email to firstname.lastname@example.org.
Overall, about 30 people attended, and we had a diverse group in terms of age and experience levels. We were happy to see a few men, and one of them, Peter Doherty, highlighted this meetup and WiBD community on his blog, adding a few thoughts on gender equity and men’s role in promoting diversity. As a rule, WiBD events are open to men who care about diversity as well as about technical topics, and we are open to male advocates delivering training, keynotes or joining the team.
Many thanks to all who attended, and we hope to see you next time!
Women in Big Data hosted “Empowering Big Data in Hybrid Cloud”, a networking event sponsored by Nutanix on November 2, 2017, featuring Wendy Pfieffer, CIO of Nutanix, Sudeesh Nair, President of Nutanix and Usha Upadhyayula, Software Engineer at Intel and Women in Big Data member. Over fifty people attended this very well received event.
The event was kicked off by Usha giving an overview of Women in Big Data, followed by Wendy sharing her amazing experience in Big Data while working at Yahoo and now at Nutanix. Wendy said that one of the challenges of Big Data is personal data security. As a CIO, she emphasized the importance of security when working with personal information and lessons she learned while working with personal information at Yahoo. She also talked about how technology is changing rapidly and how, for a successful career, we need to love to learn, not be afraid to try and fail, choose paths wisely, and follow them to completion.
Next, Sudeesh Nair talked about how businesses are moving towards enabling a better consumer experience with Big Data, while making infrastructure transparent and delivering more value. Compute has become distributed (moved to where data is being generated) and the infrastructure, which has become very complex to manage massive amounts of data, needs to be managed properly. A true web-scale architecture is important in dealing with these complexities in infrastructure to make big data consumption seamless. Nutanix is building software and services on AI-based architecture and exposing API to completely automate and simplify how the infrastructure is managed, bringing together public and private cloud.
Sudheesh presented his Big Data/Machine Learning presentation again at .NEXT, Nutanix’s end user event, November 9th at Nice, France. You can watch his presentation at minute 53:19: of the “Day 2 Morning” video. Sudeesh was followed by Maryam Sanglaji, who took us through a demo that ran the Datameer application on a Hadoop cluster deployed on a Nutanix system. She showed Prism (Nutanix central management GUI) and how the workload affects the infrastructure (CPU, Memory, Storage utilization). Like many other enterprise applications, Nutanix software powers Big Data applications and supports these applications with their scale-out architecture. Maryam also shared her personal journey and experiences within engineering and product management organizations.
We also had a chance to hear from some of the women behind Nutanix’s big data projects. We heard from Christie Lee (Systems Engineer), Neha Agrawal (Internal Data Anaytics) and Preethi Mohan (Data Scientist)–three extraordinary women who shared their stories of how they got to Nutanix and how their careers in Big Data have flourished.
Attendees also had an opportunity to meet with the recruiters at Nutanix after the event.
Women in Big Data would like to thank Michele Taylor-Smith, who helped organize the event at Nutanix, and to send a big “Thank You” to Wendy Pfieffer, Sudeesh Nair and the panelists. Overall, it was an amazing evening with so many from the community network joining together to learn from each other.
On Wednesday, November 15, 2017, more than 60 members and guests of the Women In Big Data Forum met at WalmartLabs in Sunnyvale, Calirfornia, for an introduction into the Data Visualization field and a discussion about trends and entry points into the field. WalmartLabs generously provided us with excellent facilities and food for the event.
Women in Big Data Forum leadership team members Yulia Tell, Intel Corporation, and Stella Mashkevitch kicked off the event by welcoming participants and highlighting the many opportunities involvement in WiBD affords them, from joining a committee…to running a training…to sponsoring an event.
Vladimir Kroz, WalmartLabs, highlighted WalmartLab’s cutting edge research and development groups and many current openings and career opportunities.
The next presenter was Partha Padmanabhan, an instructor on Data Visualization and Data Modeling at the University of California at Santa Clara extension in the evening and a Senior Data and Solutions Architect at Cisco by day. With more than 20 years working in Data Modeling and Relational Database concepts, Partha has a great deal of experience in Big Data and the In-Memory database. His expertise includes creating Logical/Physical and Dimensional Data models for Transaction applications, Enterprise Data Warehouse systems, designing Big Data solutions, Analytical Data Modeling, and enabling Business Intelligence Solutions using Tableau for Dashboards and Data Visualization needs.
Partha started his presentation with a definition of terms: What is data visualization? Where can elements of data visualization, such as dashboards, be met? He then went into details about different types of Data Visualization: exploration, explanation, and the importance of taking into account the audience’s expectations. Despite great advances in the area of Data Visualization, technological and even just human abilities to comprehend information remain pain points.
Next, Partha went over major players in the field and gave an overview of their products currently on the market. He also showed examples of good and bad data visualizations, and touched briefly on the trends in Data Visualization development, finishing with an overview of entry points for software professionals into Data Visualization.
Prompted by questions from the audience, Partha touched on forecasting visualization and choosing the best tool for specific visualization needs.
After two months of preparation, the Women in Big Data Southern California chapter officially kicked off on Nov 9th, 2017 on the University of Southern California campus. Over 100 men and women with diverse backgrounds were in attendance.
The check-in line was out of the door
It was an honor to host Netflix as we celebrated the beginning of a new chapter. The theme of the event was Engineering and Analytics from Pitch to Play. The event started off with an opening speech given by the founder of the chapter, April Zeng, who introduced the mission of Women in Big Data, talked about past events other chapters have hosted, and laid out what the Southern California chapter has achieved since it was founded in August 2017.
April Zeng gives the opening speech
After the opening speech, the stage was passed to the Netflix speakers. Netflix uses data, analytics, and machine learning at every point along the journey that takes a show from its earliest stages (“a pitch”) to the final product (“the video you see when you press play”). Four incredible ladies, Becky Tucker, Jen Walraven, Tanya Mosisoglu and Monisha Kanoth, spoke about how their work facilitates that workflow and how they use data and automation to make that process smarter and easier. Additional Netflixers, Jason Flittner and Nishant Hedge, shared their industry insights with the audience. A Netflix Recruiter, Hong Nguyen, also joined us to network with potential future Netflixers.
Jen Walraven, Monisha Kanoth, Becky Tucker, Tanya Mosisoglu
Let’s learn a little more about our Netflix speakers.
Becky Tucker is a Senior Data Scientist at Netflix, and she holds a Ph.D. in Physics from Caltech. At Netflix, Becky works on models that predict the demand for TV shows and movies. She started introducing data science for content selection and creation by clearing out the rumor that Netflix uses machine learning to write script. Netflix still leaves the script writing to the creatives. Data Science does not dominate creative freedom, but it can help with choosing content by using machine learning along with decision tree and regression models. Data engineering that produces clean and organized data is the backbone of her data science work. She also introduced the idea of “Comps” to solve the problems of not having enough data to greenlight all of their original series.
Jen Walraven is a Senior Analytics Engineer at Netflix on the Studio and Production Science and Analytics team and is currently working to build reporting and analytical tools to support the growth of Netflix Original productions. She shared how data analytics is used during planning, production and post-production stages–a topic that really resonated with the audience, who were inspired by the close collaboration between data science, analytics and business to drive visibility and insights.
Tanya Mosisoglu is a Senior Analytics Engineer at Netflix. She works on the data science and analytics team supporting the digital supply chain and studio production in Los Angeles. These days, she is most excited about launching a “Girls Who Code” club at Chaminade Middle School in Chatsworth, California. “With analytics skills at hand, you can pretty much go to any industrial and be valuable,” said Tanya. By sharing her experience, she inspired the audience to show confidence and value in their careers. Tanya also provided information about the Netflix Preferred Vendor (NPV) program, a program that contains the vendor with the lowest re-delivery rate. Her team works with:
• Operational teams in Hollywood to understand how to improve the digital supply chain.
• Engineers to influence what data to gather.
• Stakeholders to define new KPIs.
These are what make her job challenging, yet very rewarding.
Monisha Kanoth is a Senior Data Engineer at Netflix and was one of the founding members of the Content Data Engineering & Analytics team. She previously worked as a Big Data and Data Warehouse lead. She talked about the use of data flow automation to ensure efficient and accurate reporting, and data transparency across the organization. She mentioned that skills with technologies such as Spark, Apache Pig and Python are essential. The audience was thrilled by her expertise and asked questions about how to prepare for data-related jobs. Monisha kindly made some recommendations and stressed the importance of learning new techniques in order to keep up with the constantly changing tech industry.
Panelists answer audience questions
The kickoff event concluded with a 40-minute panel discussion session. Speakers discussed the culture, working in teams, required skills, challenges and other details of their jobs. A few highlights from the questions asked include “What skills do we leverage from a MBA & Marketing program?”…“How do you fit in Netflix’s culture?” … and “How do you explain machine learning to your boss since current models are mostly black box?” Abbass Al Sharif, Director of USC Marshall Business Analytics Program, asked how to empower women in the field. “Don’t be afraid to ask for it,” Jen replied. “If you are awesome in your work and interested in leadership, speak it early and frequently.”
Jason Flittner from Netflix, WiBD SoCal Chapter Committee Members Rachel Feldman, Maha Raghavan, Hafsa Dawood, Ushita Palande, April Zeng
A few of WiBD SoCal Chapter Committee Members at the kickoff event: Rachel Feldman, Hafsa Dawood, Tania Wang, Rose Byrd, Judy He, Xinyue Cao, Sylvie Chen
Women in Big Data’s North West chapter successfully organized a meetup. “All about Machine Learning,” to educate women about the technical concepts involved in Machine Learning.
The event was held at Intel’s Hawthorne Farm campus in Hillsboro, Oregon, and was extremely well received; around 50 people attended, including four men who constructively participated in the discussions and Q&A.
Not only was the number of women who attended impressive, but I was surprised at the varied backgrounds represented: the audience ranged from leading engineers, leads and managers from Intel, to CEOs of energy, finance and HR firms who eagerly wanted to learn about ways how Machine Learning can add to their businesses growth.
We had around 45 mins of networking before the event began, where the audience actively discussed problems in the fields and how technology and machine learning is attempting to solve it. It was amazing to witness everyone come together–and to put various aspects from their work together. We also had some women who were looking for opportunities to join the technology field after a career break, and we discussed possible opportunities with the leaders there.
Several women stepped up and asked us to collaborate with their companies to host similar events. Clearly, the community loved our effort and wanted to be a part of more such events!
The talk was kicked off by Soumya Guptha, Marketing Manager, Software and Solutions Group, Intel Corporation. A founding members of Women in Big Data, Soumya described her journey with the organization and how they aspire to reach to a goal of 25% female representation in leading technologies. It instantly spiked a strong conviction in the audience to do their bit towards the cause.
After that, Dr. Meena Arunachalam, Principal Engineer, Data Center Group, Intel Corporation gave an exemplary talk on Machine Learning. Although ML is a vast topic, Meena expertly broke it into explicit smaller topics, starting from the journey of Machine Learning in the early 1960s to the point of the technology evolution today. She presented simple and lucid examples from daily life to understand complicated concepts like neurons, neural networks, feed forward networks and back propagation–all features that form the basis of the technology. She also discussed how varied industries such as energy, oil and gas, healthcare, and government are leveraging machine learning to accelerate their processes. . As a senior leader in the field, Meena emphasized that adapting and moving quickly as the technology evolves can take us a long way in our careers, and she guided the audience on the steps and learning opportunities that can help them get into this industry. Especially enjoyable was Meena’s explanation of Deep Learning, which holistically covered numerous points on how to create an efficient Deep Learning model. Click here to download Meena’s slides.
Finally, Kripa Sankaranarayanan, who works in the space of Artificial Intelligence at Intel Corporation, showed the audience a demo on creating a machine learning model using Tensor Flow and Intel’s Neon. She walked through simple steps on installing Tensor Flow and instructed attendees on how they can start making models right on their laptops! The audience was thrilled to have seen the demo of the model and look to building their own models!
Many thanks to Meena and Kripa, who put in a lot of effort to guide us through through a complicated but extremely beneficial topic. Thanks also to our amazing photographer Leo Aqrabawi, who is a Platform Architect with Intel Corporation. He volunteered to bring us all these beautiful memories from the event. Thank you!
Agata Gruza and me (Bhakti Hinduja), work as Software Engineers in the Big Data space in Intel Corporation and are active members at Women in Big Data, cherished the experience of organizing this event with these wonderful ladies. We were delighted to see the community and will work towards bringing in more events in the future.
The author, Bhakti Hinduja, is a Big Data Software Development Engineer with Intel. She works on developing big data and machine learning software
The EMEA region of the Women in Big Data Forum is growing!
The kick off of the new chapter of Women in Big Data in Dublin was officially announced on October 26th during Spark Summit Europe.
Jessica Mccarthy, Staff Research Scientist with Intel Labs (focused on the research and development of future internet of things technologies), opened the presentation of the new chapter with an overview of the Women in Big Data Forum. Click here to download Jessica’s presentation deck.
Following Jessica, Marina B. Alekseeva of Intel gave the keynote presentation about diversity. Marina is Vice President and Software and Services Group General Manager, Intel Russia Software and Services Group. Click here to download Marina’s presentation.
Both Jessica and Marina managed a Q&A session afterward.
This formal announcement at a large industry event is the first step on the roadmap of local events the new WiBD chapter–watch for more news early in 2018. We are thrilled to welcome Dublin on the WiBD EMEA map, joining Russia and France. Next in the line are Germany and Poland. Stay tuned for more details!
On September 26, 2017, nearly 200 data professionals gathered at LinkedIn headquarters in Sunnyvale to hear a presentation by seven of the teams that work in a massive, coordinated effort to bring us the “People You May Know” (PYMK) section of the LinkedIn web page.
Kapil Surlaker, Senior Director of Engineering at LinkedIn, began the presentation with an overview of the functional areas from the product side, data side, and infrastructure and platform space that contribute to the PYMK product. On the product side, there are product managers, developer teams and test engineers at work to create the applications for LinkedIn’s 500+ million members.
The Data infrastructure and Analytics platform space is represented by teams of Systems and Infra developers, SREs, and operations teams that all contribute to solving the enormous challenge of scaling the platforms. Finally, there are Data analysts, Data scientists, and Relevance engineers who work to make each member’s user experience on the LinkedIn web page more compelling.
Hema Raghavan, Senior Manager and Head of Machine Learning for Growth at LinkedIn, introduced us to the PYMK product. Its mission: “To connect members to the people who matter most to them professionally, enabling them to access opportunities within the LinkedIn ecosystem.” PMYK gives members a nearly effortless way to grow their networks, thereby creating more opportunities and access to industry knowledge.
Mina Doroudis a Staff Data Scientist on the Analytics team at LinkedIn. The analytics team ensures that any changes to PYMK are data driven, true to the values of LinkedIn and, most of all, create a good member experience. Higher quality connections create a better user experience. The better the user experience, the more frequently they will return to the LinkedIn site. The goal is to present each member with the highest quality PYMK candidates, so they are more likely to request a connection. Two important metrics used to study this are “Invitations sent and accepted” and “Invitations received and accepted.” In both cases, attention is given to someone who is either sending requests and not getting accepts, or is receiving invitations but not accepting many of them. This is important since every new connection could introduce 1 to 32 jobs, 12 companies and 748 people.
Heloise Logan is a Staff Software Engineer working on machine learning and recommendation systems. She discussed how the PYMK framework was built. Candidate generation is a big part of PYMK. The objective is to generate a list of candidates ranked by the probability that the member will click to send a connection request. Learning is achieved by looking at the member profile, network and activity data. Heloise then went on to describe the offline and online systems that comprise the PYMK architecture. Offline systems are time and resource intensive. A social graph is used to identify possible 2nd degree candidates, and an economic graph is explored to find candidates that are outside of the social graph. Models are applied and then all pairs are scored and ranked. When a member adds a new connection, or has explored some other aspect of the LinkedIn site before coming to the PYMK page, those data are used to generate contextual candidates that are processed by the online system. Offline and online candidates are merged, re-scored and re-ranked, and the resulting list is presented to the member on their People You May Know tab. If a member likes a candidate, they are more likely to request a connection, which in turn feeds the “Invitations sent and accepted” metric.
The next presenter, Min Liu, a senior software engineer, discussed A/B testing at LinkedIn. She described her role as that of being half statistician and half engineer. The statistician side develops methodologies used in A/B testing. The engineering side implements them. LinkedIn measures success based on how many connections users are making. A/B testing is used to measure and quantify feature impact and to improve the user experience. Not only is A/B testing done automatically for over 2000 metrics, slice and dice is also performed on the results to achieve more granular visibility areas, such as user segment and geolocation. The “test everything” culture that exists at LinkedIn is very evident in the 300+ A/B tests that run concurrently every day. Each experiment has 2000+ metrics. Trillions of tracking events get sent through the platform every day. Despite this scale, the first A/B report is generally available within two hours and updated hourly after that.
Navina Ramesh is a Senior Software Engineer for the data infrastructure team, which is responsible for the collection, processing and accessibility of the data that must be processed as soon as it arrives and made available to the analysts for offline processing. The data must flow through the infrastructure as follows: Over ten-thousand tracking events (page views, clicks, etc.) and metrics (behavior after exposure to an A/B test) are generated per second. They need to be gathered from points around the globe and transported with low latency and high reliability. Events must then be processed with high reliability and fault tolerance, and seamlessly moved from offline processing systems to real-time online services where they must be easily queried.
Two key systems were developed in-house and are part of the massive infrastructure machine. Apache Kafka is an open-source stream processing platform. It is the distributed backbone that transports data throughout the system. All tracking events are ingested to Kafka. Apache Samza is a stream processing framework that was developed at LinkedIn. Data comes into Samza from Kafka and other sources. Over 220 applications rely on Samza, and most of those require stateful processing.
Suja Viswesan, Senior Engineering Manager for the big data platform, discussed how analytics are done at LinkedIn. She began by sharing the history of the People You May Know product. Nine years ago, LinkedIn had 40-50 million members. It took six months to do the computations necessary for PYMK. They were able to scrape together the resources to purchase a twelve-node Hadoop cluster that enabled them to run the same computations in two weeks.
Gobblin is the Apache platform that ingests all data sources, stores them in HDFS, and makes them available for analytics through an abstraction layer called Dali, a logical data access layer that enables a seamless user experience, no matter what changes are being made under the hood. Three additional systems have been developed in-house to assist in the management of the 2.3 trillion messages that are processed by the LinkedIn machine on a daily basis.
Cubert is a computational engine developed in-house and used along with Spark, Hive, Pig and Presto. Azkaban is the LinkedIn-developed, open-source workflow management engine that keeps all of the plates spinning. And finally, Dr. Elephant provides visibility into how to tune 200,000+ jobs that run daily in the Big Data ecosystem. This tool helps LinkedIn scale people and systems.
The final presenters for the evening were Sandhya Ramu, Director of Big Data SREs, and Savitha Ravikrishnan, Senior SRE. They co-presented on the role of system reliability engineering at LinkedIn.
200,000+ jobs are run on 10+ clusters every day. As the member base continues to grow, there are core principles that must be in place. The number one priority that guides all other decisions is that the LinkedIn site must be up and running. In support of that initiative, developers are empowered with self-serve tools with guard rails so that they can continue to perform their fast-paced functions. Any issues must be isolated and de-bugged as quickly as possible.
Savitha described the competencies needed to keep the Hadoop infrastructure healthy. Break/Fix automation ensures that the over 9,000 hosts are healthy, and it reports any issues. The configuration management system helps manage the configurations of HDFS, Yarn and MapReduce. Software is upgraded and restarted in a rolling fashion so the site never has to be taken down. Monitoring checkpoints are in place for all functionalities 24×7.
Security is taken very seriously at LinkedIn. New user onboarding and access revocation are managed in the Lightweight Directory Access Protocol, and anyone accessing Hadoop (or any of the services on its stack) must be authenticated by Kerberos.
The evening concluded with a Q&A session with all presenters.
Women in Big Data would like to thank LinkedIn and all of the presenters and organizers for hosting this amazing evening. LinkedIn is a model of excellence that we can all learn from.
Read Full Article
Read for later
Articles marked as Favorite are saved for later viewing.
Scroll to Top
Separate tags by commas
To access this feature, please upgrade your account.